[jira] [Updated] (HADOOP-18797) Support Concurrent Writes With S3A Magic Committer

Steve Loughran (Jira) Fri, 12 Jul 2024 05:02:04 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated HADOOP-18797:
------------------------------------
    Description: 
h2. warning: no guarantee of safe/consistent results due to non-atomic job 
commit

Important: There is no guarantee that concurrent jobs writing to the same table 
are safe. This is as true for the classic FileOutputCommitter as it is for the 
S3A Magic Committer.

Why not? Because neither of these committers (or any other which works on a 
file-by-file basis) has an atomic job commit operation. If two jobs commit at 
the same time, the results are *completely undefined*. And this may not be 
detected by the applications.

If you want safe, concurrent parallel writes, use a manifest-file format with 
applications able to handle commit conflicts.

h2. problem

There is a failure in the commit process when multiple jobs are writing to a s3 
directory *concurrently* using {*}magic committers{*}.

This issue is closely related HADOOP-17318.

When multiple Spark jobs write to the same S3A directory, they upload files 
simultaneously using "__magic" as the base directory for staging. Inside this 
directory, there are multiple "/job-some-uuid" directories, each representing a 
concurrently running job.

To fix some preoblems related to concunrrency a property was introduced in the 
previous fix: "spark.hadoop.fs.s3a.committer.abort.pending.uploads". When set 
to false, it ensures that during the cleanup stage, finalizing jobs do not 
abort pending uploads from other jobs. So we see in logs this line: 
{code:java}
DEBUG [main] o.a.h.fs.s3a.commit.AbstractS3ACommitter (819): Not cleanup up 
pending uploads to s3a ...{code}
(from 
[AbstractS3ACommitter.java#L952|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L952])

However, in the next step, the {*}"__magic" directory is recursively deleted{*}:
{code:java}
INFO  [main] o.a.h.fs.s3a.commit.magic.MagicS3GuardCommitter (98): Deleting 
magic directory s3a://my-bucket/my-table/__magic: duration 0:00.560s {code}
(from [AbstractS3ACommitter.java#L1112 
|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L1112]and
 
[MagicS3GuardCommitter.java#L137)|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.java#L137)]

This deletion operation *affects the second job* that is still running because 
it loses pending uploads (i.e., ".pendingset" and ".pending" files).

The consequences can range from an exception in the best case to a silent loss 
of data in the worst case. The latter occurs when Job_1 deletes files just 
before Job_2 executes "listPendingUploadsToCommit" to list ".pendingset" files 
in the job attempt directory previous to complete the uploads with POST 
requests.

To resolve this issue, it's important {*}to ensure that only the prefix 
associated with the job currently finalizing is cleaned{*}.

Here's a possible solution:
{code:java}
/**
 * Delete the magic directory.
 */
public void cleanupStagingDirs() {
  final Path out = getOutputPath();
 //Path path = magicSubdir(getOutputPath());
  Path path = new Path(magicSubdir(out), formatJobDir(getUUID()));

  try(DurationInfo ignored = new DurationInfo(LOG, true,
      "Deleting magic directory %s", path)) {
    Invoker.ignoreIOExceptions(LOG, "cleanup magic directory", path.toString(),
        () -> deleteWithWarning(getDestFS(), path, true));
  }
} {code}
 

The side effect of this issue is that the "__magic" directory is never cleaned 
up. However, I believe this is a minor concern, even considering that other 
folders such as "_SUCCESS" also persist after jobs end.

  was:
There is a failure in the commit process when multiple jobs are writing to a s3 
directory *concurrently* using {*}magic committers{*}.

This issue is closely related HADOOP-17318.

When multiple Spark jobs write to the same S3A directory, they upload files 
simultaneously using "__magic" as the base directory for staging. Inside this 
directory, there are multiple "/job-some-uuid" directories, each representing a 
concurrently running job.

To fix some preoblems related to concunrrency a property was introduced in the 
previous fix: "spark.hadoop.fs.s3a.committer.abort.pending.uploads". When set 
to false, it ensures that during the cleanup stage, finalizing jobs do not 
abort pending uploads from other jobs. So we see in logs this line: 
{code:java}
DEBUG [main] o.a.h.fs.s3a.commit.AbstractS3ACommitter (819): Not cleanup up 
pending uploads to s3a ...{code}
(from 
[AbstractS3ACommitter.java#L952|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L952])

However, in the next step, the {*}"__magic" directory is recursively deleted{*}:
{code:java}
INFO  [main] o.a.h.fs.s3a.commit.magic.MagicS3GuardCommitter (98): Deleting 
magic directory s3a://my-bucket/my-table/__magic: duration 0:00.560s {code}
(from [AbstractS3ACommitter.java#L1112 
|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L1112]and
 
[MagicS3GuardCommitter.java#L137)|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.java#L137)]

This deletion operation *affects the second job* that is still running because 
it loses pending uploads (i.e., ".pendingset" and ".pending" files).

The consequences can range from an exception in the best case to a silent loss 
of data in the worst case. The latter occurs when Job_1 deletes files just 
before Job_2 executes "listPendingUploadsToCommit" to list ".pendingset" files 
in the job attempt directory previous to complete the uploads with POST 
requests.

To resolve this issue, it's important {*}to ensure that only the prefix 
associated with the job currently finalizing is cleaned{*}.

Here's a possible solution:
{code:java}
/**
 * Delete the magic directory.
 */
public void cleanupStagingDirs() {
  final Path out = getOutputPath();
 //Path path = magicSubdir(getOutputPath());
  Path path = new Path(magicSubdir(out), formatJobDir(getUUID()));

  try(DurationInfo ignored = new DurationInfo(LOG, true,
      "Deleting magic directory %s", path)) {
    Invoker.ignoreIOExceptions(LOG, "cleanup magic directory", path.toString(),
        () -> deleteWithWarning(getDestFS(), path, true));
  }
} {code}
 

The side effect of this issue is that the "__magic" directory is never cleaned 
up. However, I believe this is a minor concern, even considering that other 
folders such as "_SUCCESS" also persist after jobs end.


> Support Concurrent Writes With S3A Magic Committer
> --------------------------------------------------
>
>                 Key: HADOOP-18797
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18797
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.4.0
>            Reporter: Emanuel Velzi
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0, 3.3.9
>
>
> h2. warning: no guarantee of safe/consistent results due to non-atomic job 
> commit
> Important: There is no guarantee that concurrent jobs writing to the same 
> table are safe. This is as true for the classic FileOutputCommitter as it is 
> for the S3A Magic Committer.
> Why not? Because neither of these committers (or any other which works on a 
> file-by-file basis) has an atomic job commit operation. If two jobs commit at 
> the same time, the results are *completely undefined*. And this may not be 
> detected by the applications.
> If you want safe, concurrent parallel writes, use a manifest-file format with 
> applications able to handle commit conflicts.
> h2. problem
> There is a failure in the commit process when multiple jobs are writing to a 
> s3 directory *concurrently* using {*}magic committers{*}.
> This issue is closely related HADOOP-17318.
> When multiple Spark jobs write to the same S3A directory, they upload files 
> simultaneously using "__magic" as the base directory for staging. Inside this 
> directory, there are multiple "/job-some-uuid" directories, each representing 
> a concurrently running job.
> To fix some preoblems related to concunrrency a property was introduced in 
> the previous fix: "spark.hadoop.fs.s3a.committer.abort.pending.uploads". When 
> set to false, it ensures that during the cleanup stage, finalizing jobs do 
> not abort pending uploads from other jobs. So we see in logs this line: 
> {code:java}
> DEBUG [main] o.a.h.fs.s3a.commit.AbstractS3ACommitter (819): Not cleanup up 
> pending uploads to s3a ...{code}
> (from 
> [AbstractS3ACommitter.java#L952|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L952])
> However, in the next step, the {*}"__magic" directory is recursively 
> deleted{*}:
> {code:java}
> INFO  [main] o.a.h.fs.s3a.commit.magic.MagicS3GuardCommitter (98): Deleting 
> magic directory s3a://my-bucket/my-table/__magic: duration 0:00.560s {code}
> (from [AbstractS3ACommitter.java#L1112 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L1112]and
>  
> [MagicS3GuardCommitter.java#L137)|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.java#L137)]
> This deletion operation *affects the second job* that is still running 
> because it loses pending uploads (i.e., ".pendingset" and ".pending" files).
> The consequences can range from an exception in the best case to a silent 
> loss of data in the worst case. The latter occurs when Job_1 deletes files 
> just before Job_2 executes "listPendingUploadsToCommit" to list ".pendingset" 
> files in the job attempt directory previous to complete the uploads with POST 
> requests.
> To resolve this issue, it's important {*}to ensure that only the prefix 
> associated with the job currently finalizing is cleaned{*}.
> Here's a possible solution:
> {code:java}
> /**
>  * Delete the magic directory.
>  */
> public void cleanupStagingDirs() {
>   final Path out = getOutputPath();
>  //Path path = magicSubdir(getOutputPath());
>   Path path = new Path(magicSubdir(out), formatJobDir(getUUID()));
>   try(DurationInfo ignored = new DurationInfo(LOG, true,
>       "Deleting magic directory %s", path)) {
>     Invoker.ignoreIOExceptions(LOG, "cleanup magic directory", 
> path.toString(),
>         () -> deleteWithWarning(getDestFS(), path, true));
>   }
> } {code}
>  
> The side effect of this issue is that the "__magic" directory is never 
> cleaned up. However, I believe this is a minor concern, even considering that 
> other folders such as "_SUCCESS" also persist after jobs end.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-18797) Support Concurrent Writes With S3A Magic Committer

Reply via email to