moomindani opened a new issue, #16661:
URL: https://github.com/apache/iceberg/issues/16661

   **Query engine**: Other (engine-agnostic, `iceberg-core`)
   **Willingness to contribute**: I can contribute this improvement/feature 
independently
   
   ## Feature Request / Improvement
   
   ### Motivation
   
   Today a `MetricsReporter` only ever observes **successful** operations:
   
   - `ScanReport` is emitted from `SnapshotScan#planFiles`; a planning failure 
(synchronous or surfaced lazily during scan-file iteration) is not reported as 
a failure.
   - `CommitReport` is emitted from `SnapshotProducer#notifyListeners` only on 
a `CreateSnapshotEvent`, i.e. **after a commit succeeds**. A failed commit 
(`CommitFailedException` after retries, `ValidationException` from conflict 
checks, `CommitStateUnknownException`) emits nothing.
   
   So operators wiring a `MetricsReporter` (or `RESTMetricsReporter`) cannot 
observe failure rate, retry/backoff behavior, or conflict frequency — common 
production concerns (e.g. [AWS's write-conflict guide for Iceberg on 
Glue](https://aws.amazon.com/blogs/big-data/manage-concurrent-write-conflicts-in-apache-iceberg-on-the-aws-glue-data-catalog/)).
   
   ### Proposal
   
   Emit a report on failure from `SnapshotProducer#commit` (covering 
`CommitFailedException`, `ValidationException`, `CommitStateUnknownException`) 
and from `SnapshotScan#planFiles`, best-effort so reporting never masks the 
original failure. A prototype exists (iceberg-core only) with unit tests that 
assert a failed commit and a failed scan each produce a failure report; the 
tests were confirmed to fail without the emission change; existing 
reporting/REST tests stay green.
   
   ### Design decision (needs input)
   
   **Decision 1 — separate report type vs. reuse existing reports.**
   
   | | New `CommitFailureReport`/`ScanFailureReport` (chosen) | Reuse 
`CommitReport`/`ScanReport` + a failure marker |
   |---|---|---|
   | Backward compat | Existing `instanceof CommitReport` aggregations are 
unaffected by construction | Existing consumers silently start mixing failures 
into success metrics unless they add a filter |
   | Opt-in flag | Not needed (non-disruptive) | Likely needed to avoid 
surprising existing reporters |
   | Type safety / discoverability | High (typed) | Lower (a flag/metadata key 
consumers must know about) |
   | API surface | Adds new public types (additive; revapi-safe) | No new types 
|
   
   **Decision 2 — deliver failures over REST, or in-process only.**
   
   | | In-process only (chosen) | Also POST to the catalog |
   |---|---|---|
   | REST spec | No change; `RESTMetricsReporter` just skips failure types | 
New `*FailureReport` schemas + `report-type` values + server support + 
wire-compat concerns |
   | Reach | Client-side reporters (logging, OpenTelemetry, custom) | Catalog 
also sees failures |
   
   **Prototype choice: (1) separate failure report types, (2) in-process only** 
(no REST-schema change). Does this match what the project wants, or would you 
prefer reusing the existing reports / delivering failures over REST?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to