[GitHub] [hadoop] mustafaiman commented on issue #1795: HADOOP-16792: Make S3 client request timeout configurable

GitBox Wed, 22 Jan 2020 12:36:07 -0800

mustafaiman commented on issue #1795: HADOOP-16792: Make S3 client request 
timeout configurable
URL: https://github.com/apache/hadoop/pull/1795#issuecomment-577373810
 
 
   @steveloughran I ran ITestS3AHugeFilesDiskBlocks#test_010_CreateHugeFile 
with some combinations.
   
   The first experiments used default file size and partition for huge files. I 
set request timeout to 1 ms for the first experiment. Test file system failed 
to initialize. This is because verifyBuckets call in the beginning times out 
repeteadly. This is retried within AWS sdk code up to 
`com.amazonaws.ClientConfiguration#maxErrorRetry` times. This value is 
configurable from Hadoop side via property `fs.s3a.attempts.maximum`. All of 
this retries are opaque to Hadoop. At the end of this retry cycle, aws sdk 
returns failure to Hadoop's Invoker. Then, Invoker evaluates whether to retry 
this operation or not according to its configured retry policies. I saw that 
verifyBuckets call were not retried on Invoker level.
   
   In a followup experiment, I set request timeout to 200ms, which is enough 
for verifyBuckets call to succeed but short enough that multi part uploads 
fail. In these cases, again AWS sdk retries these http requests up to 
`maxErrorRetry` times. After this http request fails `maxErrorRetry` times, 
Invoker's retry mechanism kicks in. I observed Invoker to retry these 
operations up to `fs.s3a.retry.limit` times conforming to configured 
exponential back-off limited retry policy. After all these 
`fs.s3a.retry.limit`*`maxErrorRetry` retries, Invoker bubbles up 
AWSClientIOException to the user as shown below:
   ```
   org.apache.hadoop.fs.s3a.AWSClientIOException: upload part on 
tests3ascale/disk/hugefile: com.amazonaws.SdkClientException: Unable to execute 
HTTP request: Request did not complete before the request timeout 
configuration.: Unable to execute HTTP request: Request did not complete before 
the request timeout configuration.
        at 
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:205)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:112)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:315)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:407)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:311)
   ```
   
   Later, I ran the test with 256M file size and 32M partitionsize. I set the 
request timeout to 5s. My goal was to introduce a few retries due to short 
request timeout, but complete the upload operation with the use of retries. I 
managed to do that. I saw some retries due to short request timeout, but they 
were retried and the upload operation completed successfully. The test failed 
anyway because it also expected that `TRANSFER_PART_FAILED_EVENT`  be 0. This 
is obviously not the case because some transfers failed but they were retried. 
I checked S3 and verified that the file was there. I also verified that 
temporary partition files were cleared in my local drive.
   
   When I run the same experiment with 8GB file and 128M partitions but with 
small request timeout, the test fails due to uploads not being completed.
   
   I also ran a soak test with 8GB files with a large request timeout. This 
passed fine as expected because timeout value was high enough to let uploads 
complete.
   
   
   @bgaborg 
   I did not add a functional test here. I am repeating what we talked offline 
to leave a record of the reasoning. The retry mechanism is entirely within AWS 
SDK as I explained earlier in this comment. To introduce a functional test, we 
need a mechanism to selectively delay/fail some requests because we want file 
system initialization to succeed but a subsequent dummy operation(like 
getFileStatus) to be delayed. Introducing such test support is very hard if not 
impossible since hadoop-aws does not have any visibility into this mechanism. 
So, we depend on AWS SDK to maintain this mechanism for which they expose a 
configuration option here. I think this is very reasonable assumption. If AWS 
SDK cuts support for this feature in the feature, the configuration test I 
added here will fail anyway. Because I expect AWS SDK to at least throw an 
exception trying to set a configuration on ClientConfiguration object that is 
not supported anymore, or they would remove the method completely which would 
cause a compile error.
   
   @steveloughran @bgaborg 
   I addressed the other code comments in the code. This final version of code 
has passed hadoop-aws test suite against us-west-1.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hadoop] mustafaiman commented on issue #1795: HADOOP-16792: Make S3 client request timeout configurable

Reply via email to