[jira] [Comment Edited] (HADOOP-18797) S3A committer fix lost data on concurrent jobs

Syed Shameerur Rahman (Jira) Mon, 28 Aug 2023 08:58:15 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759527#comment-17759527
 ]


Syed Shameerur Rahman edited comment on HADOOP-18797 at 8/28/23 3:57 PM:
-------------------------------------------------------------------------

This looks like a valid use-case when multiple job writes to same table but 
different partitions, The MPU metadata (pendingset) of slower running jobs 
might be deleted by the the jobs which completes first.

I could think of three approaches here

Approach 1:  Do job level magic directory deletion ie (__magic/job_<jobId>/) 
(as mentioned by [~emanuelvelzi])

1. After the job is completed delete the path __magic/job_<jobId>/

Pros
1. Concurrent writes will be supported

Cons
1. __magic directory will be visible in the table path even though it won't be 
considered
2. The remains of failed job which stay forever unless manually deleted or via 
some S3 policies

Inorder to solve 
[HADOOP-18568|https://issues.apache.org/jira/browse/HADOOP-18568] we can put 
this behind a config similar to  fs.s3a.cleanup.magic.enabled


Approach 2:  Optional delete of __magic directory as mentioned in 
[HADOOP-18568|https://issues.apache.org/jira/browse/HADOOP-18568]

1. Based on the config we can choose to delete or not delete the magic directory

Pros
1. Solves both concurrent and scaling issues.

Cons
1. Say we have two spark clusters, One with config enabled to delete the 
__magic and another with config disabled, If they simultaneously hit the same 
table but different partition we will again hit the same concurrency issue as 
mentioned in this Jira.



Approach 3: Have unique magic directory for each job i.e __magic_job<id> 
(similar to staging directory in FileOutputCommitter)

1. Each job will write pendingset to its specified __magic_job<id>
2. The directory will be deleted after successful commit of the job.

Pros
1. Concurrent writes will be supported
2. if all the jobs are successful no __magic_* directory will be visible 

Cons
1. The remains of failed job which stay forever unless manually deleted or via 
some S3 policies which is similar to FileOutputCommitter




was (Author: srahman):
This looks like a valid use-case when multiple job writes to same table but 
different partitions, The MPU metadata (pendingset) of slower running jobs 
might be deleted by the the jobs which completes first.

I could think of two approaches here

Approach 1:  Do job level magic directory deletion ie (__magic/job_<jobId>/) 
(as mentioned by [~emanuelvelzi])

1. After the job is completed delete the path __magic/job_<jobId>/

Pros
1. Concurrent writes will be supported

Cons
1. __magic directory will be visible in the table path even though it won't be 
considered
2. The remains of failed job which stay forever unless manually deleted or via 
some S3 policies

Inorder to solve 
[HADOOP-18568|https://issues.apache.org/jira/browse/HADOOP-18568] we can put 
this behind a config similar to  fs.s3a.cleanup.magic.enabled


Approach 2:  Optional delete of __magic directory as mentioned in 
[HADOOP-18568|https://issues.apache.org/jira/browse/HADOOP-18568]

1. Based on the config we can choose to delete or not delete the magic directory

Pros
1. Solves both concurrent and scaling issues.

Cons
1. Say we have two spark clusters, One with config enabled to delete the 
__magic and another with config disabled, If they simultaneously hit the same 
table but different partition we will again hit the same concurrency issue as 
mentioned in this Jira.







> S3A committer fix lost data on concurrent jobs
> ----------------------------------------------
>
>                 Key: HADOOP-18797
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18797
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 3.3.6
>            Reporter: Emanuel Velzi
>            Priority: Major
>
> There is a failure in the commit process when multiple jobs are writing to a 
> s3 directory *concurrently* using {*}magic committers{*}.
> This issue is closely related HADOOP-17318.
> When multiple Spark jobs write to the same S3A directory, they upload files 
> simultaneously using "__magic" as the base directory for staging. Inside this 
> directory, there are multiple "/job-some-uuid" directories, each representing 
> a concurrently running job.
> To fix some preoblems related to concunrrency a property was introduced in 
> the previous fix: "spark.hadoop.fs.s3a.committer.abort.pending.uploads". When 
> set to false, it ensures that during the cleanup stage, finalizing jobs do 
> not abort pending uploads from other jobs. So we see in logs this line: 
> {code:java}
> DEBUG [main] o.a.h.fs.s3a.commit.AbstractS3ACommitter (819): Not cleanup up 
> pending uploads to s3a ...{code}
> (from 
> [AbstractS3ACommitter.java#L952|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L952])
> However, in the next step, the {*}"__magic" directory is recursively 
> deleted{*}:
> {code:java}
> INFO  [main] o.a.h.fs.s3a.commit.magic.MagicS3GuardCommitter (98): Deleting 
> magic directory s3a://my-bucket/my-table/__magic: duration 0:00.560s {code}
> (from [AbstractS3ACommitter.java#L1112 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L1112]and
>  
> [MagicS3GuardCommitter.java#L137)|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.java#L137)]
> This deletion operation *affects the second job* that is still running 
> because it loses pending uploads (i.e., ".pendingset" and ".pending" files).
> The consequences can range from an exception in the best case to a silent 
> loss of data in the worst case. The latter occurs when Job_1 deletes files 
> just before Job_2 executes "listPendingUploadsToCommit" to list ".pendingset" 
> files in the job attempt directory previous to complete the uploads with POST 
> requests.
> To resolve this issue, it's important {*}to ensure that only the prefix 
> associated with the job currently finalizing is cleaned{*}.
> Here's a possible solution:
> {code:java}
> /**
>  * Delete the magic directory.
>  */
> public void cleanupStagingDirs() {
>   final Path out = getOutputPath();
>  //Path path = magicSubdir(getOutputPath());
>   Path path = new Path(magicSubdir(out), formatJobDir(getUUID()));
>   try(DurationInfo ignored = new DurationInfo(LOG, true,
>       "Deleting magic directory %s", path)) {
>     Invoker.ignoreIOExceptions(LOG, "cleanup magic directory", 
> path.toString(),
>         () -> deleteWithWarning(getDestFS(), path, true));
>   }
> } {code}
>  
> The side effect of this issue is that the "__magic" directory is never 
> cleaned up. However, I believe this is a minor concern, even considering that 
> other folders such as "_SUCCESS" also persist after jobs end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-18797) S3A committer fix lost data on concurrent jobs

Reply via email to