ChristinaTech commented on issue #7151:
URL: https://github.com/apache/iceberg/issues/7151#issuecomment-1477096812

   As for solutions, here are options I have considered:
   ##### Option 1: Never delete the metadata file on a failed commit
   While the simplest solution, generally speaking there ARE cases where we 
know we can clean this up.
   
   ##### Option 2: Disable all retries for Glue and DynamoDB clients
   This hurts reliability in normal usage to an unacceptable degree.
   
   ##### Option 3: Disable AWS SDK retries just for the vulnerable calls and 
perform retries via Iceberg Tasks class
   While this mechanism can decide whether to check the actual commit status on 
a case-by-case basis, I have 2 concerns about this:
   * It moves retry logic currently handled by the AWS SDK into Iceberg, which 
could cause problems for any customers providing custom AWS clients or 
customers not expecting the changed retry behavior.
   * Every mechanism I can come up with to disable retries for specific calls 
is fragile and couples it to the SDK implementation and specific provides AWS 
Client factory.
   
   ##### Option 4: Detect AWS SDK retries and check commit status if they 
occurred
   I looked into this a bit and the available ways to do this all have issues:
   * Checking the thrown exception for suppressed SdkExceptions attached to it. 
This works but as these suppressed exceptions are not a documented part of the 
SDK they are vulnerable to breaking in the future.
   * Attaching a metrics collector to the request specifically to grab the 
`CoreMetrics.RETRY_COUNT` value. This is publicly documented but means 
branching on a metric, which just feels odd. It also requires a side-channel to 
make this information available to the exception handler. Despite that, its 
probably the solution least likely to break.
   * Modify the AWS SDK clients factory to provide retry count via a 
side-channel. This breaks if a modified AWS client factory is being provided by 
the end-user.
   
   ##### Option 5: Always check commit status if the thrown exception is an AWS 
SDK exception
   This avoids all the retry introspection of the above approach, but means we 
end up spending commit time/latency checking the commit status when we might 
not need to. Its still preferable to risking corrupting the table state though.
   
   ##### Current Plans
   Of these, Option 2 is easy to rule out for the reliability hit, and while 
Option 1 is the fastest and simplest its the most likely to leave orphan files 
behind. For the time being I am working on a pull request based on Option 5, as 
its a good starting point for further work if we go with Option 4 as well.  We 
might decide to go with Option 3 (or an option I haven't thought of) instead, 
but my priority is to get a viable fix available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to