I would agree with making 503 non-retryable by default. Though I don't necessarily agree with the spec change in the PR. I feel like the only thing that needs to be updated is the component definition in the spec to say that clients may only retry if a `Retry-After` header is present. Otherwise, they should treat it as non-retryable.
I believe we only included that language due to services like S3 using "503 Service Unavailable" as opposed to the more appropriate "429 Too Many Requests" for throttling, but I don't think 503 is a good practice to propagate for this scenario. We don't call out 429 in the spec, but most clients support this by default and I believe the reference implementation does as well. -Dan On Mon, Aug 11, 2025 at 5:26 PM huaxin gao <huaxin.ga...@gmail.com> wrote: > I'm in favor of marking 503 as non-retryable for updateTable to avoid > possible data corruption. Even though 503 can sometimes mean throttling, I > think treating it as CommitStateUnknown is the safer choice for commit > operations. We can look into better handling for throttling cases, like > using 429, in the future. > > On Tue, Aug 5, 2025 at 11:00 AM Steve Loughran <ste...@cloudera.com.invalid> > wrote: > >> 503 is throttling on many (but not all) of the AWS services (s3, >> lambda, redshift..) - though not dynamodb or others that return 400 + some >> message. >> >> If aws were to deploy an endpoint, they'd have to decide what to do. or >> even better, the spec could say "there is a specific error code for >> throttling, "429: Too Many Requests" to be returned in such a situation. >> >> On Sat, 2 Aug 2025 at 03:32, Dennis Huo <huoi...@gmail.com> wrote: >> >>> PR LGTM, thanks for following up on this! >>> >>> Reiterating my comments in the PR, IMO especially now that Russell's >>> https://github.com/apache/iceberg/pull/13449 is merged so that all the >>> read-only methods can still perform low-level retries, this change makes >>> sense as a safer default for commit behaviors. >>> >>> I could imagine a few potential follow-up features if people want to >>> achieve better availability in situations where more out-of-band knowledge >>> of specific server behaviors can be used on a case-by-case basis: >>> >>> 1. If anyone feels very strongly that in a given service environment >>> some particular 5xx code really is safely retriable without reconciliation, >>> we could consider extracting a runtime config that lists the non-retriable >>> HTTP codes instead of only having the hard-coded switch statement coming at >>> the cost of some code complexity >>> 2. We could make it easier to inject a >>> dynamically-classloaded HttpRequestRetryStrategy in the HttpClient ( >>> https://github.com/apache/iceberg/blob/a8fdb23682a7c083ea6ff9873f1531dd9d465aa7/core/src/main/java/org/apache/iceberg/rest/HTTPClient.java#L127) >>> - it seems there's already some precedent for injectable helpers here such >>> as rest.client.tls.configurer-impl -- custom impls could either be as >>> simple as using a different set of 5xx non-retriable codes, or as complex >>> as cooperating with other customized client-side logic to use an end-to-end >>> idempotency request-uuid to distinguish safe retries >>> 3. We could consider letting other response headers such as >>> "Retry-After" ( >>> https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Retry-After) >>> be a heuristic to interpret the *intent* of the server to communicate a >>> "graceful unavailability" that is safe to retry, vs an unexpected failure >>> where state is unknown >>> >>> In any case though, it does seem prudent to start with the safe default. >>> >>> On Wed, Jul 23, 2025 at 1:22 PM Prashant Singh <prashant010...@gmail.com> >>> wrote: >>> >>>> Hey All, >>>> Presently we treat 503 status code in spec here <http:///> as if >>>> the request was not tried at all. While doing 1.9.2 it was brought up that >>>> this exception might not be true and some of the very common services like >>>> Istio and Envoy were brought up as examples. >>>> >>>> As a result I wanted to formally bring up disabling retries on 503 on >>>> updateTable and treating that as CommitStateUnknown similar to what we do >>>> for 500, 502, 504. The rationale for this being 503 doesn't necessarily >>>> mean the request was not tried at all and systems like ENVOY can return 503 >>>> in the middle of processing. >>>> >>>> Adding detailed analysis of my colleague Dennis here : >>>> https://lists.apache.org/thread/h8xowcntomh7r5woz24cv27v3o8r9wgd >>>> >>>> It seemed like most people participating in the thread above agreed, >>>> starting this thread to see if there are objections to it in a wider forum. >>>> >>>> Meanwhile, I have filed a PR for the same >>>> [1] https://github.com/apache/iceberg/pull/13619 >>>> >>>> Please let us know your thoughts and provide your feedback ! >>>> >>>> Best, >>>> Prashant Singh >>>> >>>> >>>> >>>>