moomindani opened a new issue, #16661: URL: https://github.com/apache/iceberg/issues/16661
**Query engine**: Other (engine-agnostic, `iceberg-core`) **Willingness to contribute**: I can contribute this improvement/feature independently ## Feature Request / Improvement ### Motivation Today a `MetricsReporter` only ever observes **successful** operations: - `ScanReport` is emitted from `SnapshotScan#planFiles`; a planning failure (synchronous or surfaced lazily during scan-file iteration) is not reported as a failure. - `CommitReport` is emitted from `SnapshotProducer#notifyListeners` only on a `CreateSnapshotEvent`, i.e. **after a commit succeeds**. A failed commit (`CommitFailedException` after retries, `ValidationException` from conflict checks, `CommitStateUnknownException`) emits nothing. So operators wiring a `MetricsReporter` (or `RESTMetricsReporter`) cannot observe failure rate, retry/backoff behavior, or conflict frequency — common production concerns (e.g. [AWS's write-conflict guide for Iceberg on Glue](https://aws.amazon.com/blogs/big-data/manage-concurrent-write-conflicts-in-apache-iceberg-on-the-aws-glue-data-catalog/)). ### Proposal Emit a report on failure from `SnapshotProducer#commit` (covering `CommitFailedException`, `ValidationException`, `CommitStateUnknownException`) and from `SnapshotScan#planFiles`, best-effort so reporting never masks the original failure. A prototype exists (iceberg-core only) with unit tests that assert a failed commit and a failed scan each produce a failure report; the tests were confirmed to fail without the emission change; existing reporting/REST tests stay green. ### Design decision (needs input) **Decision 1 — separate report type vs. reuse existing reports.** | | New `CommitFailureReport`/`ScanFailureReport` (chosen) | Reuse `CommitReport`/`ScanReport` + a failure marker | |---|---|---| | Backward compat | Existing `instanceof CommitReport` aggregations are unaffected by construction | Existing consumers silently start mixing failures into success metrics unless they add a filter | | Opt-in flag | Not needed (non-disruptive) | Likely needed to avoid surprising existing reporters | | Type safety / discoverability | High (typed) | Lower (a flag/metadata key consumers must know about) | | API surface | Adds new public types (additive; revapi-safe) | No new types | **Decision 2 — deliver failures over REST, or in-process only.** | | In-process only (chosen) | Also POST to the catalog | |---|---|---| | REST spec | No change; `RESTMetricsReporter` just skips failure types | New `*FailureReport` schemas + `report-type` values + server support + wire-compat concerns | | Reach | Client-side reporters (logging, OpenTelemetry, custom) | Catalog also sees failures | **Prototype choice: (1) separate failure report types, (2) in-process only** (no REST-schema change). Does this match what the project wants, or would you prefer reusing the existing reports / delivering failures over REST? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
