[jira] [Commented] (HADOOP-19047) Support InMemory Tracking Of S3A Magic Commits

ASF GitHub Bot (Jira) Sun, 04 Feb 2024 23:14:04 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-19047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814189#comment-17814189
 ]


ASF GitHub Bot commented on HADOOP-19047:
-----------------------------------------

shameersss1 commented on code in PR #6468:
URL: https://github.com/apache/hadoop/pull/6468#discussion_r1477749474


##########
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.java:
##########
@@ -264,9 +326,14 @@ public void abortTask(TaskAttemptContext context) throws 
IOException {
     try (DurationInfo d = new DurationInfo(LOG,
         "Abort task %s", context.getTaskAttemptID());
         CommitContext commitContext = initiateTaskOperation(context)) {
-      getCommitOperations().abortAllSinglePendingCommits(attemptPath,
-          commitContext,
-          true);
+      if (isTrackMagicCommitsInMemoryEnabled(context.getConfiguration())) {
+        List<SinglePendingCommit> pendingCommits = 
loadPendingCommitsFromMemory(context);
+        for (SinglePendingCommit singleCommit : pendingCommits) {
+          commitContext.abortSingleCommit(singleCommit);
+        }
+      } else {
+        getCommitOperations().abortAllSinglePendingCommits(attemptPath, 
commitContext, true);

Review Comment:
   AFIK, Spark calls abortTask from the same process (executor), When the job 
fails, The abortJob operation is called which basically lists all the pending 
uploads and aborts it as mentioned in the comment 
[here](https://github.com/apache/hadoop/pull/6468#issuecomment-1926348440)
   
   I am not sure why would a different process call abortTask, The driver 
process should ideally call abortJob if a job fails.





> Support InMemory Tracking Of S3A Magic Commits
> ----------------------------------------------
>
>                 Key: HADOOP-19047
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19047
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Syed Shameerur Rahman
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>              Labels: pull-request-available
>
> The following are the operations which happens within a Task when it uses S3A 
> Magic Committer. 
> *During closing of stream*
> 1. A 0-byte file with a same name of the original file is uploaded to S3 
> using PUT operation. Refer 
> [here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicCommitTracker.java#L152]
>  for more information. This is done so that the downstream application like 
> Spark could get the size of the file which is being written.
> 2. MultiPartUpload(MPU) metadata is uploaded to S3. Refer 
> [here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicCommitTracker.java#L176]
>  for more information.
> *During TaskCommit*
> 1. All the MPU metadata which the task wrote to S3 (There will be 'x' number 
> of metadata file in S3 if a single task writes to 'x' files) are read and 
> rewritten to S3 as a single metadata file. Refer 
> [here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.java#L201]
>  for more information
> Since these operations happens with the Task JVM, We could optimize as well 
> as save cost by storing these information in memory when Task memory usage is 
> not a constraint. Hence the proposal here is to introduce a new MagicCommit 
> Tracker called "InMemoryMagicCommitTracker" which will store the 
> 1. Metadata of MPU in memory till the Task is committed
> 2. Store the size of the file which can be used by the downstream application 
> to get the file size before it is committed/visible to the output path.
> This optimization will save 2 PUT S3 calls, 1 LIST S3 call, and 1 GET S3 call 
> given a Task writes only 1 file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-19047) Support InMemory Tracking Of S3A Magic Commits

Reply via email to