mustafaiman commented on issue #1795: HADOOP-16792: Make S3 client request timeout configurable URL: https://github.com/apache/hadoop/pull/1795#issuecomment-577373810 @steveloughran I ran ITestS3AHugeFilesDiskBlocks#test_010_CreateHugeFile with some combinations. The first experiments used default file size and partition for huge files. I set request timeout to 1 ms for the first experiment. Test file system failed to initialize. This is because verifyBuckets call in the beginning times out repeteadly. This is retried within AWS sdk code up to `com.amazonaws.ClientConfiguration#maxErrorRetry` times. This value is configurable from Hadoop side via property `fs.s3a.attempts.maximum`. All of this retries are opaque to Hadoop. At the end of this retry cycle, aws sdk returns failure to Hadoop's Invoker. Then, Invoker evaluates whether to retry this operation or not according to its configured retry policies. I saw that verifyBuckets call were not retried on Invoker level. In a followup experiment, I set request timeout to 200ms, which is enough for verifyBuckets call to succeed but short enough that multi part uploads fail. In these cases, again AWS sdk retries these http requests up to `maxErrorRetry` times. After this http request fails `maxErrorRetry` times, Invoker's retry mechanism kicks in. I observed Invoker to retry these operations up to `fs.s3a.retry.limit` times conforming to configured exponential back-off limited retry policy. After all these `fs.s3a.retry.limit`*`maxErrorRetry` retries, Invoker bubbles up AWSClientIOException to the user as shown below: ``` org.apache.hadoop.fs.s3a.AWSClientIOException: upload part on tests3ascale/disk/hugefile: com.amazonaws.SdkClientException: Unable to execute HTTP request: Request did not complete before the request timeout configuration.: Unable to execute HTTP request: Request did not complete before the request timeout configuration. at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:205) at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:112) at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:315) at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:407) at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:311) ``` Later, I ran the test with 256M file size and 32M partitionsize. I set the request timeout to 5s. My goal was to introduce a few retries due to short request timeout, but complete the upload operation with the use of retries. I managed to do that. I saw some retries due to short request timeout, but they were retried and the upload operation completed successfully. The test failed anyway because it also expected that `TRANSFER_PART_FAILED_EVENT` be 0. This is obviously not the case because some transfers failed but they were retried. I checked S3 and verified that the file was there. I also verified that temporary partition files were cleared in my local drive. When I run the same experiment with 8GB file and 128M partitions but with small request timeout, the test fails due to uploads not being completed. I also ran a soak test with 8GB files with a large request timeout. This passed fine as expected because timeout value was high enough to let uploads complete. @bgaborg I did not add a functional test here. I am repeating what we talked offline to leave a record of the reasoning. The retry mechanism is entirely within AWS SDK as I explained earlier in this comment. To introduce a functional test, we need a mechanism to selectively delay/fail some requests because we want file system initialization to succeed but a subsequent dummy operation(like getFileStatus) to be delayed. Introducing such test support is very hard if not impossible since hadoop-aws does not have any visibility into this mechanism. So, we depend on AWS SDK to maintain this mechanism for which they expose a configuration option here. I think this is very reasonable assumption. If AWS SDK cuts support for this feature in the feature, the configuration test I added here will fail anyway. Because I expect AWS SDK to at least throw an exception trying to set a configuration on ClientConfiguration object that is not supported anymore, or they would remove the method completely which would cause a compile error. @steveloughran @bgaborg I addressed the other code comments in the code. This final version of code has passed hadoop-aws test suite against us-west-1.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
