wombatu-kun commented on issue #16659:
URL: https://github.com/apache/iceberg/issues/16659#issuecomment-4599862925

   Hi @martinskeem, thanks for the detailed write-up and stack trace.
   
   **1. Is 409 the spec'd status for commit conflicts?**
   
   Yes. The REST catalog OpenAPI spec models the commit route (`updateTable`) 
so that a requirement/assertion failure is a 409, not a 400:
   
   - `409` -> "Conflict - CommitFailedException, one or more requirements 
failed. The client may retry."
   - `400` -> `BadRequestErrorResponse`, and the operation description states: 
"Server implementations are required to fail with a 400 status code if any 
unknown updates or requirements are received."
   
   So the spec reserves 400 for a malformed commit (unknown 
updates/requirements), while a genuine concurrent-write conflict - "one or more 
requirements failed" - is a 409 that the client may retry. A catalog returning 
400 for a concurrent-write conflict conflates those two cases, which looks like 
a server-side compliance gap. That citation (in 
`open-api/rest-catalog-open-api.yaml`, the `updateTable` responses) should be 
useful for a report to Databricks.
   
   **2. Why it wedges the connector (root cause)**
   
   The retry is keyed on exception type, not status code:
   
   - `ErrorHandlers.CommitErrorHandler` maps `409 -> CommitFailedException` and 
lets `400` fall through to `DefaultErrorHandler`, which raises 
`BadRequestException`.
   - `SnapshotProducer.commit` retries with 
`.onlyRetryOn(CommitFailedException.class)`.
   
   So a 400 becomes a terminal `BadRequestException` that escapes the 
retry-with-refresh loop entirely, and the coordinator can't make progress - 
exactly what you're seeing.
   
   **3. On the workaround in #16644**
   
   The mechanism is sound - remapping to `CommitFailedException` is what 
restores the optimistic-concurrency retry. A few things worth weighing on the 
approach (I'm a contributor here, not a maintainer, so whether core should 
carry this is their call):
   
   - Matching on message text (`contains "commit"` + `"validation failed"`) is 
coupled to Databricks' current wording. If they reword the message it silently 
stops working, and it won't help other non-compliant catalogs that phrase it 
differently.
   - It changes default behavior for every REST catalog user. A genuinely 
malformed 400 whose message happens to contain those tokens would be 
reclassified as a retryable conflict and retried up to 
`commit.retry.num-retries` times, which can mask real client errors.
   - There are no tests yet; `TestErrorHandlers` has no `CommitErrorHandler` 
coverage, so it would be good to pin both the match and the fall-through.
   
   Possible directions (maintainers' call):
   
   - Cleanest: treat this as a Databricks compliance bug and file it - the spec 
citation above is the basis. That keeps core spec-compliant.
   - If a client-side mitigation is still wanted, an opt-in catalog property 
(default off) that remaps 400 conflict responses to `CommitFailedException` 
would keep spec-compliant behavior as the default rather than applying an 
unconditional heuristic to everyone. Worth pairing with tests.
   - Either way it is a behavior change in engine-agnostic `core/`, so a quick 
note on the dev@ list (or tagging REST-catalog maintainers) before investing 
further is probably wise.
   
   Happy to help with tests or a config-gated variant if you decide to take it 
forward.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to