[ 
https://issues.apache.org/jira/browse/HADOOP-19734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18032522#comment-18032522
 ] 

Steve Loughran commented on HADOOP-19734:
-----------------------------------------

And stack on a write failure. 
{code}
[ERROR] 
org.apache.hadoop.fs.s3a.scale.ITestS3AHugeFilesArrayBlocks.test_010_CreateHugeFile
 -- Time elapsed: 2.870 s <<< ERROR!
org.apache.hadoop.fs.s3a.AWSBadRequestException: Completing multipart upload on 
job-00/test/tests3ascale/array/src/hugefile: 
software.amazon.awssdk.services.s3.model.S3Exception: One or more of the 
specified parts could not be found.  The part may not have been uploaded, or 
the specified entity tag may not match the part's entity tag. (Service: S3, 
Status Code: 400, Request ID: 1NNBCSX4NCDN7G9X, Extended Request ID: 
8vMmeyt1GfjGrf3UL9AN8vlwWSn9860f1gdeIBC3drmcjeQwC6wOPinMD8MSO6ggGw9ywwdcXroGTdVSFLYq0S0VdM/5bYfanDXJ43Eb4QU=)
 (SDK Attempt Count: 1):InvalidPart: One or more of the specified parts could 
not be found.  The part may not have been uploaded, or the specified entity tag 
may not match the part's entity tag. (Service: S3, Status Code: 400, Request 
ID: 1NNBCSX4NCDN7G9X, Extended Request ID: 
8vMmeyt1GfjGrf3UL9AN8vlwWSn9860f1gdeIBC3drmcjeQwC6wOPinMD8MSO6ggGw9ywwdcXroGTdVSFLYq0S0VdM/5bYfanDXJ43Eb4QU=)
 (SDK Attempt Count: 1)
        at 
org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:265)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:124)
        at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:376)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:372)
        at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.finalizeMultipartUpload(WriteOperationHelper.java:318)
        at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.completeMPUwithRetries(WriteOperationHelper.java:370)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.lambda$complete$3(S3ABlockOutputStream.java:1227)
        at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation(IOStatisticsBinding.java:493)
        at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:464)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.complete(S3ABlockOutputStream.java:1225)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.access$1500(S3ABlockOutputStream.java:876)
        at 
org.apache.hadoop.fs.s3a.S3ABlockOutputStream.close(S3ABlockOutputStream.java:545)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at 
org.apache.hadoop.fs.s3a.scale.AbstractSTestS3AHugeFiles.test_010_CreateHugeFile(AbstractSTestS3AHugeFiles.java:276)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at java.util.ArrayList.forEach(ArrayList.java:1259)
        at java.util.ArrayList.forEach(ArrayList.java:1259)
Caused by: software.amazon.awssdk.services.s3.model.S3Exception: One or more of 
the specified parts could not be found.  The part may not have been uploaded, 
or the specified entity tag may not match the part's entity tag. (Service: S3, 
Status Code: 400, Request ID: 1NNBCSX4NCDN7G9X, Extended Request ID: 
8vMmeyt1GfjGrf3UL9AN8vlwWSn9860f1gdeIBC3drmcjeQwC6wOPinMD8MSO6ggGw9ywwdcXroGTdVSFLYq0S0VdM/5bYfanDXJ43Eb4QU=)
 (SDK Attempt Count: 1)
        at 
software.amazon.awssdk.services.s3.model.S3Exception$BuilderImpl.build(S3Exception.java:113)
        at 
software.amazon.awssdk.services.s3.model.S3Exception$BuilderImpl.build(S3Exception.java:61)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.retryPolicyDisallowedRetryException(RetryableStageHelper.java:168)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:73)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
        at 
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
        at 
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:53)
        at 
software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:35)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:82)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:62)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:43)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
        at 
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
        at 
software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
        at 
software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
        at 
software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:210)
        at 
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
        at 
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
        at 
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
        at 
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
        at 
software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
        at 
software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
        at 
software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
        at 
software.amazon.awssdk.services.s3.DefaultS3Client.completeMultipartUpload(DefaultS3Client.java:801)
        at 
software.amazon.awssdk.services.s3.DelegatingS3Client.lambda$completeMultipartUpload$1(DelegatingS3Client.java:611)
        at 
software.amazon.awssdk.services.s3.internal.crossregion.S3CrossRegionSyncClient.invokeOperation(S3CrossRegionSyncClient.java:67)
        at 
software.amazon.awssdk.services.s3.DelegatingS3Client.completeMultipartUpload(DelegatingS3Client.java:611)
        at 
org.apache.hadoop.fs.s3a.impl.S3AStoreImpl.completeMultipartUpload(S3AStoreImpl.java:906)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem$WriteOperationHelperCallbacksImpl.completeMultipartUpload(S3AFileSystem.java:1953)
        at 
org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$finalizeMultipartUpload$1(WriteOperationHelper.java:324)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
        ... 17 more

{code}

we'd have to map 400 + the error text to a "MultipartUploadCompleteFailed" 
exception and add a policy for it, leaving other 400s as unrecoverable.

> S3A: retry on MPU completion failure "One or more of the specified parts 
> could not be found"
> --------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-19734
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19734
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.4.2
>         Environment: aws s3 london
>            Reporter: Steve Loughran
>            Priority: Minor
>
> Experienced transient failure in test run of 
> https://github.com/apache/hadoop/pull/7882 : all MPU complete posts failed 
> because the request or parts were not found...the tests started succeeding 
> 60-90s later *and* a "hadoop s3guards uploads" call listed the outstanding 
> uploads of the failing tests.
> Hypothesis: a transient failure meant the server receiving the POST calls to 
> complete the uploads was mistakenly reporting no upload IDs.
> Outcome: all active write operations failed, without any retry attempts. This 
> can lose data and fail jobs, even though the store may recover.
> Proposed. The multipart uploads, especially block output stream, retry on 
> this error; treat it as a connectivity issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to