[ 
https://issues.apache.org/jira/browse/HADOOP-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18032525#comment-18032525
 ] 

Steve Loughran commented on HADOOP-19734:
-----------------------------------------

+ any tracking in block output stream should record when the POST to initiate 
the MPU was issued. That way if an error still surfaces but the output stream 
has been open for three days, we have a good cause "stream open too long"

> S3A: retry on MPU completion failure "One or more of the specified parts 
> could not be found"
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-19734
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19734
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.4.2
>         Environment: aws s3 london
>            Reporter: Steve Loughran
>            Priority: Minor
>
> Experienced transient failure in test run of 
> https://github.com/apache/hadoop/pull/7882 : all MPU complete posts failed 
> because the request or parts were not found...the tests started succeeding 
> 60-90s later *and* a "hadoop s3guards uploads" call listed the outstanding 
> uploads of the failing tests.
> Hypothesis: a transient failure meant the server receiving the POST calls to 
> complete the uploads was mistakenly reporting no upload IDs.
> Outcome: all active write operations failed, without any retry attempts. This 
> can lose data and fail jobs, even though the store may recover.
> Proposed. The multipart uploads, especially block output stream, retry on 
> this error; treat it as a connectivity issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to