[jira] [Commented] (HADOOP-19221) S3A: Unable to recover from failure of multipart block upload attempt "Status Code: 400; Error Code: RequestTimeout"

ASF GitHub Bot (Jira) Fri, 13 Sep 2024 12:09:05 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881668#comment-17881668
 ]


ASF GitHub Bot commented on HADOOP-19221:
-----------------------------------------

steveloughran opened a new pull request, #7044:
URL: https://github.com/apache/hadoop/pull/7044

   
   This is a major change which handles 400 error responses when uploading 
large files from memory heap/buffer (or staging committer) and the remote S3 
store returns a 500 response from a upload of a block in a multipart upload.
   
   The SDK's own streaming code seems unable to fully replay the upload; at 
attempts to but then blocks and the S3 store returns a 400 response
   
       "Your socket connection to the server was not read from or written to
        within the timeout period. Idle connections will be closed.
        (Service: S3, Status Code: 400...)"
   
   There is an option to control whether or not the S3A client itself attempts 
to retry on a 50x error other than 503 throttling events (which are 
independently processed as before)
   
   Option:  fs.s3a.retry.http.5xx.errors
   Default: true
   
   500 errors are very rare from standard AWS S3, which has a five nines SLA. 
It may be more common against S3 Express which has lower guarantees.
   
   Third party stores have unknown guarantees, and the exception may indicate a 
bad server configuration. Consider setting fs.s3a.retry.http.5xx.errors to 
false when working with such stores.
   
   Signification Code changes:
   
   There is now a custom set of implementations of
   software.amazon.awssdk.http.ContentStreamProvidercontent in the class 
org.apache.hadoop.fs.s3a.impl.UploadContentProviders.
   
   These:
   
   * Restart on failures
   * Do not copy buffers/byte buffers into new private byte arrays, so avoid 
exacerbating memory problems..
   
   There new IOStatistics for specific http error codes -these are collected 
even when all recovery is performed within the SDK.
     
   S3ABlockOutputStream has major changes, including handling of 
Thread.interrupt() on the main thread, which now triggers and briefly awaits 
cancellation of any ongoing uploads.
   
   If the writing thread is interrupted in close(), it is mapped to an 
InterruptedIOException. Applications like Hive and Spark must catch these after 
cancelling a worker thread.
   
   Contributed by Steve Loughran
   
   
   ### How was this patch tested?
   
   in progress
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> S3A: Unable to recover from failure of multipart block upload attempt "Status 
> Code: 400; Error Code: RequestTimeout"
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-19221
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19221
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.4.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>
> If a multipart PUT request fails for some reason (e.g. networrk error) then 
> all subsequent retry attempts fail with a 400 Response and ErrorCode 
> RequestTimeout .
> {code}
> Your socket connection to the server was not read from or written to within 
> the timeout period. Idle connections will be closed. (Service: Amazon S3; 
> Status Code: 400; Error Code: RequestTimeout; Request ID:; S3 Extended 
> Request ID:
> {code}
> The list of supporessed exceptions contains the root cause (the initial 
> failure was a 500); all retries failed to upload properly from the source 
> input stream {{RequestBody.fromInputStream(fileStream, size)}}.
> Hypothesis: the mark/reset stuff doesn't work for input streams. On the v1 
> sdk we would build a multipart block upload request passing in (file, offset, 
> length), the way we are now doing this doesn't recover.
> probably fixable by providing our own {{ContentStreamProvider}} 
> implementations for
> # file + offset + length
> # bytebuffer
> # byte array
> The sdk does have explicit support for the memory ones, but they copy the 
> data blocks first. we don't want that as it would double the memory 
> requirements of active blocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19221) S3A: Unable to recover from failure of multipart block upload attempt "Status Code: 400; Error Code: RequestTimeout"

Reply via email to