steveloughran commented on PR #6938: URL: https://github.com/apache/hadoop/pull/6938#issuecomment-2253149858
@shameersss1 I really don't know what best to do here. We have massively cut back on the number of retries which take place in the V2 SDK compared to V1; even though we have discussed in the past turning it off completely and handling it all ourselves. However, that would break things the transfer manager does in separate threads. The thing is, I do not know how often we see 500 errors against AWS S3 stores (rather than third party ones with unrecoverable issues) -and now we have seen them I don't know what the right policy should be. The only documentation on what to do seems more focused on 503s, and doesn't provide any hints about why a 500 could happen or what to do other than "keep trying maybe it'll go away": https://repost.aws/knowledge-center/http-5xx-errors-s3 . I do suspect it is very rare -otherwise the AWS team might have noticed their lack of resilience here, and we would've found it during our own testing. Any 500 error at any point other than multipart uploads probably gets recovered from nicely so that could've been a background noise of these which we have never noticed before. s3a FS stats will now track these, which may be informative. I don't want to introduce another configuration switch if possible because that at more to documentation testing maintenance et cetera. One thing I was considering is should we treat this exactly the same as a throttling exception which has its own configuration settings for retries? Anyway, if you could talk to your colleagues and make some suggestions based on real knowledge of what can happen that would be really nice. Note that we are treating 500 as idempotent, the way we do with all the other failures even though from a distributed computing purism perspective it is not in fact true. Not looked at the other comments yet; will do later. Based on a code walk-through with Mukud, Harshit and Saikat, I've realised we should make absolutely sure that the stream providing a subset of file fails immediately if the read() goes past the allocated space. With tests, obviously. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
