Re: [PR] HADOOP-19221. S3A: Unable to recover from failure of multipart block upload attempt [hadoop]

via GitHub Fri, 26 Jul 2024 10:03:44 -0700


steveloughran commented on PR #6938:
URL: https://github.com/apache/hadoop/pull/6938#issuecomment-2253149858


   @shameersss1 
   I really don't know what best to do here. 
   
   We have massively cut back on the number of retries which take place in the 
V2 SDK compared to V1; even though we have discussed in the past turning it off 
completely and handling it all ourselves. However, that would break things the 
transfer manager does in separate threads.
   
   The thing is, I do not know how often we see 500 errors against AWS S3 
stores (rather than third party ones with unrecoverable issues) -and now we 
have seen them I don't know what the right policy should be. The only 
documentation on what to do seems more focused on 503s, and doesn't provide any 
hints about why a 500 could happen or what to do other than "keep trying maybe 
it'll go away": https://repost.aws/knowledge-center/http-5xx-errors-s3 . I do 
suspect it is very rare -otherwise the AWS team might have noticed their lack 
of resilience here, and we would've found it during our own testing. Any 500 
error at any point other than multipart uploads probably gets recovered from 
nicely so that could've been a background noise of these which we have never 
noticed before. s3a FS stats will now track these, which may be informative.
   
   I don't want to introduce another configuration switch if possible because 
that at more to documentation testing maintenance et cetera. One thing I was 
considering is should we treat this exactly the same as a throttling exception 
which has its own configuration settings for retries?
   
   Anyway, if you could talk to your colleagues and make some suggestions based 
on real knowledge of what can happen that would be really nice. Note that we 
are treating 500 as idempotent, the way we do with all the other failures even 
though from a distributed computing purism perspective it is not in fact true.
   
   Not looked at the other comments yet; will do later. Based on a code 
walk-through with Mukud, Harshit and Saikat, I've realised we should make 
absolutely sure that the stream providing a subset of file fails immediately if 
the read() goes past the allocated space. With tests, obviously.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HADOOP-19221. S3A: Unable to recover from failure of multipart block upload attempt [hadoop]

Reply via email to