Emanuel Velzi created HADOOP-18797:
--------------------------------------

             Summary: S3A committer fix lost data on concurrent jobs
                 Key: HADOOP-18797
                 URL: https://issues.apache.org/jira/browse/HADOOP-18797
             Project: Hadoop Common
          Issue Type: Bug
          Components: fs/s3
            Reporter: Emanuel Velzi


There is a failure in the commit process when multiple jobs are writing to a s3 
directory *concurrently* using {*}magic committers{*}.

This issue is closely related HADOOP-17318.

When multiple Spark jobs write to the same S3A directory, they upload files 
simultaneously using "__magic" as the base directory for staging. Inside this 
directory, there are multiple "/job-some-uuid" directories, each representing a 
concurrently running job.

To fix some preoblems related to concunrrency a property was introduced in the 
previous fix: "spark.hadoop.fs.s3a.committer.abort.pending.uploads". When set 
to false, it ensures that during the cleanup stage, finalizing jobs do not 
abort ppending uploads from other jobs. So we see in logs this line: 
{code:java}
DEBUG [main] o.a.h.fs.s3a.commit.AbstractS3ACommitter (819): Not cleanup up 
pending uploads to s3a ...{code}
(from 
[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L952])

 

However, in the next step, the {*}"__magic" directory is recursively deleted{*}:
{code:java}
INFO  [main] o.a.h.fs.s3a.commit.magic.MagicS3GuardCommitter (98): Deleting 
magic directory s3a://my-bucket/my-table/__magic: duration 0:00.560s {code}
(from   
[https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.java#L137)]

This deletion operation *affects the second job* that is still running because 
it loses pending uploads (i.e., ".pendingset" and ".pending" files).

The consequences can range from an exception in the best case to a silent loss 
of data in the worst case. The latter occurs when Job_1 deletes files just 
before Job_2 executes "listPendingUploadsToCommit" to list ".pendingset" files 
in the job attempt directory previous to complete the uploads with POST 
requests.

To resolve this issue, it's important {*}to ensure that only the prefix 
associated with the job currently finalizing is cleaned{*}.

Here's a possible solution:
{code:java}
/**
 * Delete the magic directory.
 */
public void cleanupStagingDirs() {
//Path path = magicSubdir(getOutputPath());
  Path path = new Path(magicSubdir(getOutputPath()), 
formatAppAttemptDir(getUUID()));
  try(DurationInfo ignored = new DurationInfo(LOG, true,
      "Deleting magic directory %s", path)) {
    Invoker.ignoreIOExceptions(LOG, "cleanup magic directory", path.toString(),
        () -> deleteWithWarning(getDestFS(), path, true));
  }
} {code}
 

The side effect of this issue is that the "__magic" directory is never cleaned 
up. However, I believe this is a minor concern, even considering that other 
folders such as "_SUCCESS" also persist after jobs end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to