[
https://issues.apache.org/jira/browse/HADOOP-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779784#comment-17779784
]
ASF GitHub Bot commented on HADOOP-18797:
-----------------------------------------
shameersss1 commented on PR #6122:
URL: https://github.com/apache/hadoop/pull/6122#issuecomment-1780575205
> looks good and I'm not going to review backports except for backporting
issues.
>
> which s3 region did you test against, and with what parameters?
How was this patch tested?
1. Ran S3A Unit Tests
2. Ran S3A Integration test in us-west-1 region
[INFO] Running
org.apache.hadoop.fs.s3a.commit.staging.TestDirectoryCommitterScale [INFO]
Running org.apache.hadoop.fs.s3a.commit.staging.TestPaths [INFO] Running
org.apache.hadoop.fs.s3a.commit.TestMagicCommitPaths [INFO] Running
org.apache.hadoop.fs.s3a.commit.staging.TestStagingCommitter [INFO] Running
org.apache.hadoop.fs.s3a.commit.staging.TestStagingPartitionedFileListing
[INFO] Running
org.apache.hadoop.fs.s3a.commit.staging.TestStagingDirectoryOutputCommitter
[INFO] Running
org.apache.hadoop.fs.s3a.commit.staging.TestStagingPartitionedJobCommit [INFO]
Running
org.apache.hadoop.fs.s3a.commit.staging.TestStagingPartitionedTaskCommit [INFO]
Tests run: 28, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.473 s - in
org.apache.hadoop.fs.s3a.commit.TestMagicCommitPaths [INFO] Tests run: 14,
Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.611 s - in
org.apache.hadoop.fs.s3a.commit.staging.TestPaths [INFO] Tests run: 8,
Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 20.965 s - in
org.apache.hadoop.fs.s3a.commit.staging.TestStagingDirectoryOutputCommitter
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 23.474 s
- in org.apache.hadoop.fs.s3a.commit.staging.TestStagingPartitionedFileListing
[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.627 s
- in org.apache.hadoop.fs.s3a.commit.staging.TestStagingPartitionedJobCommit
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 29.249 s
- in org.apache.hadoop.fs.s3a.commit.staging.TestStagingPartitionedTaskCommit
[INFO] Tests run: 63, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 61.11 s
- in org.apache.hadoop.fs.s3a.commit.staging.TestStagingCommitter [INFO] Tests
run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 84.193 s - in
org.apache.hadoop.fs.s3a.commit.staging.TestDirectoryCommitterScale [INFO]
Running
org.apache.hadoop.fs.s3a.commit.staging.integration.ITestStagingCommitProtocol
[INFO] Running
org.apache.hadoop.fs.s3a.commit.staging.integration.ITestDirectoryCommitProtocol
[INFO] Running
org.apache.hadoop.fs.s3a.commit.staging.integration.ITestPartitionedCommitProtocol
[INFO] Running
org.apache.hadoop.fs.s3a.commit.staging.integration.ITestStagingCommitProtocolFailure
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.901 s
- in
.hadoop.fs.s3a.commit.staging.integration.ITestStagingCommitProtocolFailure
[INFO] Running org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitProtocol
[INFO] Running
org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitProtocolFailure [INFO]
Running org.apache.hadoop.fs.s3a.commit.integration.ITestS3ACommitterMRJob
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.225 s
- in org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitProtocolFailure
[INFO] Running org.apache.hadoop.fs.s3a.commit.ITestS3ACommitterFactory [INFO]
Running org.apache.hadoop.fs.s3a.commit.ITestCommitOperations [INFO] Running
org.apache.hadoop.fs.s3a.commit.ITestCommitOperationCost [INFO] Tests run: 1,
Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 7.05 s - in
org.apache.hadoop.fs.s3a.commit.ITestS3ACommitterFactory [INFO] Tests run: 4,
Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 43.398 s - in
org.apache.hadoop.fs.s3a.commit.ITestCommitOperationCost [INFO] Tests run: 18,
Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 143.308 s - in
org.apache.hadoop.fs.s3a.commit.ITestCommitOperations [INFO] Running
org.apache.hadoop.fs.s3a.auth.ITestAssumedRoleCommitOperations [INFO] Tests
run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 218.466 s - in
org.apache.hadoop.fs.s3a.commit.integration.ITestS3ACommitterMRJob [WARNING]
Tests run: 18, Failures: 0, Errors: 0, Skipped: 18, Time elapsed: 62.036 s - in
org.apache.hadoop.fs.s3a.auth.ITestAssumedRoleCommitOperations [WARNING] Tests
run: 24, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 429.367 s - in
.hadoop.fs.s3a.commit.staging.integration.ITestPartitionedCommitProtocol [INFO]
Tests run: 24, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 472.06 s - in
.hadoop.fs.s3a.commit.staging.integration.ITestStagingCommitProtocol [INFO]
Tests run: 25, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 498.078 s - in
.hadoop.fs.s3a.commit.staging.integration.ITestDirectoryCommitProtocol [INFO]
Tests run: 23, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 861.607 s - in
org.apache.hadoop.fs.s3a.commit.magic.ITestMagicCommitProtocol [INFO] Running
org.apache.hadoop.fs.s3a.commit.magic.ITestS3AHugeMagicCommits [WARNING] Tests
run: 10, Failures: 0, Errors: 0, Skipped: 10, Time elapsed: 29.141 s - in
org.apache.hadoop.fs.s3a.commit.magic.ITestS3AHugeMagicCommits [INFO] Running
org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A [WARNING] Tests
run: 14, Failures: 0, Errors: 0, Skipped: 14, Time elapsed: 47.485 s - in
org.apache.hadoop.fs.s3a.commit.terasort.ITestTerasortOnS3A
> Support Concurrent Writes With S3A Magic Committer
> --------------------------------------------------
>
> Key: HADOOP-18797
> URL: https://issues.apache.org/jira/browse/HADOOP-18797
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/s3
> Reporter: Emanuel Velzi
> Assignee: Syed Shameerur Rahman
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
>
> There is a failure in the commit process when multiple jobs are writing to a
> s3 directory *concurrently* using {*}magic committers{*}.
> This issue is closely related HADOOP-17318.
> When multiple Spark jobs write to the same S3A directory, they upload files
> simultaneously using "__magic" as the base directory for staging. Inside this
> directory, there are multiple "/job-some-uuid" directories, each representing
> a concurrently running job.
> To fix some preoblems related to concunrrency a property was introduced in
> the previous fix: "spark.hadoop.fs.s3a.committer.abort.pending.uploads". When
> set to false, it ensures that during the cleanup stage, finalizing jobs do
> not abort pending uploads from other jobs. So we see in logs this line:
> {code:java}
> DEBUG [main] o.a.h.fs.s3a.commit.AbstractS3ACommitter (819): Not cleanup up
> pending uploads to s3a ...{code}
> (from
> [AbstractS3ACommitter.java#L952|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L952])
> However, in the next step, the {*}"__magic" directory is recursively
> deleted{*}:
> {code:java}
> INFO [main] o.a.h.fs.s3a.commit.magic.MagicS3GuardCommitter (98): Deleting
> magic directory s3a://my-bucket/my-table/__magic: duration 0:00.560s {code}
> (from [AbstractS3ACommitter.java#L1112
> |https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L1112]and
>
> [MagicS3GuardCommitter.java#L137)|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.java#L137)]
> This deletion operation *affects the second job* that is still running
> because it loses pending uploads (i.e., ".pendingset" and ".pending" files).
> The consequences can range from an exception in the best case to a silent
> loss of data in the worst case. The latter occurs when Job_1 deletes files
> just before Job_2 executes "listPendingUploadsToCommit" to list ".pendingset"
> files in the job attempt directory previous to complete the uploads with POST
> requests.
> To resolve this issue, it's important {*}to ensure that only the prefix
> associated with the job currently finalizing is cleaned{*}.
> Here's a possible solution:
> {code:java}
> /**
> * Delete the magic directory.
> */
> public void cleanupStagingDirs() {
> final Path out = getOutputPath();
> //Path path = magicSubdir(getOutputPath());
> Path path = new Path(magicSubdir(out), formatJobDir(getUUID()));
> try(DurationInfo ignored = new DurationInfo(LOG, true,
> "Deleting magic directory %s", path)) {
> Invoker.ignoreIOExceptions(LOG, "cleanup magic directory",
> path.toString(),
> () -> deleteWithWarning(getDestFS(), path, true));
> }
> } {code}
>
> The side effect of this issue is that the "__magic" directory is never
> cleaned up. However, I believe this is a minor concern, even considering that
> other folders such as "_SUCCESS" also persist after jobs end.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]