wombatu-kun commented on issue #16659: URL: https://github.com/apache/iceberg/issues/16659#issuecomment-4599862925
Hi @martinskeem, thanks for the detailed write-up and stack trace. **1. Is 409 the spec'd status for commit conflicts?** Yes. The REST catalog OpenAPI spec models the commit route (`updateTable`) so that a requirement/assertion failure is a 409, not a 400: - `409` -> "Conflict - CommitFailedException, one or more requirements failed. The client may retry." - `400` -> `BadRequestErrorResponse`, and the operation description states: "Server implementations are required to fail with a 400 status code if any unknown updates or requirements are received." So the spec reserves 400 for a malformed commit (unknown updates/requirements), while a genuine concurrent-write conflict - "one or more requirements failed" - is a 409 that the client may retry. A catalog returning 400 for a concurrent-write conflict conflates those two cases, which looks like a server-side compliance gap. That citation (in `open-api/rest-catalog-open-api.yaml`, the `updateTable` responses) should be useful for a report to Databricks. **2. Why it wedges the connector (root cause)** The retry is keyed on exception type, not status code: - `ErrorHandlers.CommitErrorHandler` maps `409 -> CommitFailedException` and lets `400` fall through to `DefaultErrorHandler`, which raises `BadRequestException`. - `SnapshotProducer.commit` retries with `.onlyRetryOn(CommitFailedException.class)`. So a 400 becomes a terminal `BadRequestException` that escapes the retry-with-refresh loop entirely, and the coordinator can't make progress - exactly what you're seeing. **3. On the workaround in #16644** The mechanism is sound - remapping to `CommitFailedException` is what restores the optimistic-concurrency retry. A few things worth weighing on the approach (I'm a contributor here, not a maintainer, so whether core should carry this is their call): - Matching on message text (`contains "commit"` + `"validation failed"`) is coupled to Databricks' current wording. If they reword the message it silently stops working, and it won't help other non-compliant catalogs that phrase it differently. - It changes default behavior for every REST catalog user. A genuinely malformed 400 whose message happens to contain those tokens would be reclassified as a retryable conflict and retried up to `commit.retry.num-retries` times, which can mask real client errors. - There are no tests yet; `TestErrorHandlers` has no `CommitErrorHandler` coverage, so it would be good to pin both the match and the fall-through. Possible directions (maintainers' call): - Cleanest: treat this as a Databricks compliance bug and file it - the spec citation above is the basis. That keeps core spec-compliant. - If a client-side mitigation is still wanted, an opt-in catalog property (default off) that remaps 400 conflict responses to `CommitFailedException` would keep spec-compliant behavior as the default rather than applying an unconditional heuristic to everyone. Worth pairing with tests. - Either way it is a behavior change in engine-agnostic `core/`, so a quick note on the dev@ list (or tagging REST-catalog maintainers) before investing further is probably wise. Happy to help with tests or a config-gated variant if you decide to take it forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
