[GitHub] [hadoop] ahmarsuhail commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

GitBox Thu, 23 Jun 2022 03:14:22 -0700


ahmarsuhail commented on code in PR #4478:
URL: https://github.com/apache/hadoop/pull/4478#discussion_r902757105



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't 
handle task failure.
 loss or corruption of generated data**
 
 
-To address these problems there is now explicit support in the `hadop-aws`
-module for committing work to Amazon S3 via the S3A filesystem client,
-*the S3A Committers*
+To address these problems there is now explicit support in the `hadoop-aws`
+module for committing work to Amazon S3 via the S3A filesystem client:
+*the S3A Committers*.
 
 
 For safe, as well as high-performance output of work to S3,
-we need use "a committer" explicitly written to work with S3, treating it as
-an object store with special features.
+we need to use "a committer" explicitly written to work with S3,
+treating it as an object store with special features.
 
 
-### Background : Hadoop's "Commit Protocol"
+### Background: Hadoop's "Commit Protocol"
 
 How exactly is work written to its final destination? That is accomplished by
 a "commit protocol" between the workers and the job manager.

Review Comment:
   line 112 has a typo. The job has "workers", which are processes which work 
_with_ the actual data
   and write the results.



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't 
handle task failure.
 loss or corruption of generated data**

Review Comment:
   On line 84, change *. to *



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -165,6 +165,7 @@ that the network has partitioned and that they must abort 
their work.
 That's "essentially" it. When working with HDFS and similar filesystems,

Review Comment:
   two full stops on line 146, no full stop on 109, 147. 



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -283,40 +281,37 @@ new data to an existing partitioned directory tree is a 
common operation.
 </property>
 ```
 
-**replace** : when the job is committed (and not before), delete files in
+The _Directory Committer_ uses the entire directory tree for conflict 
resolution.
+For this committer, the behavior of each conflict mode is shown below:
+

Review Comment:
   is there a default mode? if yes can we say which one that  is here 



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -530,18 +527,22 @@ performance.
 
 ### Enabling the committer
 
+Set the committer used by S3A's committer factory to `magic`:
+
 ```xml

Review Comment:
   unrelated, but do you know why these configs aren't listed in 
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md#general-s3a-client-configuration?
 there are other properties (eg: delegation token config) that isn't there 
either. Wondering if it's useful to have them all in one place so it's easy to 
see everything that's available 



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -696,10 +692,9 @@ The magic committer recognizes when files are created 
under paths with `__magic/
 and redirects the upload to a different location, adding the information 
needed to complete the upload

Review Comment:
   full stop on line 675



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -650,7 +639,14 @@ Conflict management is left to the execution engine itself.
 </property>

Review Comment:
   full stop/question mark on line 632. little confusing to read currently. 
also i don't see a config option for ` fs.s3a.committer.staging.uuid` in 
staging committer options list



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't 
handle task failure.
 loss or corruption of generated data**
 
 
-To address these problems there is now explicit support in the `hadop-aws`
-module for committing work to Amazon S3 via the S3A filesystem client,
-*the S3A Committers*
+To address these problems there is now explicit support in the `hadoop-aws`
+module for committing work to Amazon S3 via the S3A filesystem client:
+*the S3A Committers*.
 
 
 For safe, as well as high-performance output of work to S3,
-we need use "a committer" explicitly written to work with S3, treating it as
-an object store with special features.
+we need to use "a committer" explicitly written to work with S3,
+treating it as an object store with special features.
 
 
-### Background : Hadoop's "Commit Protocol"
+### Background: Hadoop's "Commit Protocol"
 
 How exactly is work written to its final destination? That is accomplished by
 a "commit protocol" between the workers and the job manager.

Review Comment:
   line 109 could possibly be made clearer: A "Job" is the entire query, which 
takes a given input and produces an output. 



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -283,40 +281,37 @@ new data to an existing partitioned directory tree is a 
common operation.
 </property>
 ```
 
-**replace** : when the job is committed (and not before), delete files in
+The _Directory Committer_ uses the entire directory tree for conflict 
resolution.
+For this committer, the behavior of each conflict mode is shown below:
+

Review Comment:
   looks like it's append based on the config table later, let's mention it 
here as well. 



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -68,9 +68,9 @@ process across the cluster may rename a file or directory to 
the same path.
 If the rename fails for any reason, either the data is at the original 
location,

Review Comment:
   missing full stop on line 54. 



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -180,8 +181,8 @@ and restarting the job.
 whose output is in the job attempt directory, *and only rerunning all 
uncommitted tasks*.
 
 
-This algorithm does not works safely or swiftly with AWS S3 storage because 
-tenames go from being fast, atomic operations to slow operations which can 
fail partway through.
+This algorithm does not work safely or swiftly with AWS S3 storage because
+renames go from being fast, atomic operations to slow operations which can 
fail partway through.
 
 This then is the problem which the S3A committers address:

Review Comment:
   consider adding a full stop to 188



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -474,7 +466,7 @@ files which do not contain relevant data.
 What the partitioned committer does is, where the tooling permits, allows 
callers
 to add data to an existing partitioned layout*.

Review Comment:
   Consider rephrasing. "If tool permits, the partitioned committer allows 
callers to add data to an existing partitioned layout." also remove * before 
the full stop



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -492,18 +484,19 @@ was written. With the policy of `append`, the new file 
would be added to
 the existing set of files.
 
 
-### Notes
+### Notes on using Staging Committers
 
 1. A deep partition tree can itself be a performance problem in S3 and the s3a 
client,
-or, more specifically. a problem with applications which use recursive 
directory tree
+or more specifically a problem with applications which use recursive directory 
tree
 walks to work with data.
 
 1. The outcome if you have more than one job trying simultaneously to write 
data
 to the same destination with any policy other than "append" is undefined.
 
 1. In the `append` operation, there is no check for conflict with file names.
-If, in the example above, the file `log-20170228.avro` already existed,
-it would be overridden. Set `fs.s3a.committer.staging.unique-filenames` to 
`true`
+If the file `log-20170228.avro` in the example above already existed, it would 
be overwritten.
+
+   Set `fs.s3a.committer.staging.unique-filenames` to `true`

Review Comment:
   Indentation isn't right here. I think it makes sense to have this as part of 
point 3, consider reverting this change



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't 
handle task failure.
 loss or corruption of generated data**

Review Comment:
   add a full stop to line 88



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't 
handle task failure.
 loss or corruption of generated data**
 
 
-To address these problems there is now explicit support in the `hadop-aws`
-module for committing work to Amazon S3 via the S3A filesystem client,
-*the S3A Committers*
+To address these problems there is now explicit support in the `hadoop-aws`
+module for committing work to Amazon S3 via the S3A filesystem client:
+*the S3A Committers*.
 
 
 For safe, as well as high-performance output of work to S3,
-we need use "a committer" explicitly written to work with S3, treating it as
-an object store with special features.
+we need to use "a committer" explicitly written to work with S3,
+treating it as an object store with special features.
 
 
-### Background : Hadoop's "Commit Protocol"
+### Background: Hadoop's "Commit Protocol"
 
 How exactly is work written to its final destination? That is accomplished by
 a "commit protocol" between the workers and the job manager.

Review Comment:
   Typo on line 129. Guessing it should be a separate bullet point 



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -165,6 +165,7 @@ that the network has partitioned and that they must abort 
their work.
 That's "essentially" it. When working with HDFS and similar filesystems,

Review Comment:
   I think it makes sense to swap lines 147 and 145. As individual workers will 
communicate with the job manager first, and then it will make a decision to 
commit or abort 



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##########
@@ -530,18 +527,22 @@ performance.
 
 ### Enabling the committer
 
+Set the committer used by S3A's committer factory to `magic`:
+
 ```xml

Review Comment:
   ok they're listed here: 
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/bk_cloud-data-access/content/s3-config-parameters.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[GitHub] [hadoop] ahmarsuhail commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

Reply via email to