ahmarsuhail commented on code in PR #4478: URL: https://github.com/apache/hadoop/pull/4478#discussion_r902757105
########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't handle task failure. loss or corruption of generated data** -To address these problems there is now explicit support in the `hadop-aws` -module for committing work to Amazon S3 via the S3A filesystem client, -*the S3A Committers* +To address these problems there is now explicit support in the `hadoop-aws` +module for committing work to Amazon S3 via the S3A filesystem client: +*the S3A Committers*. For safe, as well as high-performance output of work to S3, -we need use "a committer" explicitly written to work with S3, treating it as -an object store with special features. +we need to use "a committer" explicitly written to work with S3, +treating it as an object store with special features. -### Background : Hadoop's "Commit Protocol" +### Background: Hadoop's "Commit Protocol" How exactly is work written to its final destination? That is accomplished by a "commit protocol" between the workers and the job manager. Review Comment: line 112 has a typo. The job has "workers", which are processes which work _with_ the actual data and write the results. ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't handle task failure. loss or corruption of generated data** Review Comment: On line 84, change *. to * ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -165,6 +165,7 @@ that the network has partitioned and that they must abort their work. That's "essentially" it. When working with HDFS and similar filesystems, Review Comment: two full stops on line 146, no full stop on 109, 147. ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -283,40 +281,37 @@ new data to an existing partitioned directory tree is a common operation. </property> ``` -**replace** : when the job is committed (and not before), delete files in +The _Directory Committer_ uses the entire directory tree for conflict resolution. +For this committer, the behavior of each conflict mode is shown below: + Review Comment: is there a default mode? if yes can we say which one that is here ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -530,18 +527,22 @@ performance. ### Enabling the committer +Set the committer used by S3A's committer factory to `magic`: + ```xml Review Comment: unrelated, but do you know why these configs aren't listed in https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md#general-s3a-client-configuration? there are other properties (eg: delegation token config) that isn't there either. Wondering if it's useful to have them all in one place so it's easy to see everything that's available ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -696,10 +692,9 @@ The magic committer recognizes when files are created under paths with `__magic/ and redirects the upload to a different location, adding the information needed to complete the upload Review Comment: full stop on line 675 ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -650,7 +639,14 @@ Conflict management is left to the execution engine itself. </property> Review Comment: full stop/question mark on line 632. little confusing to read currently. also i don't see a config option for ` fs.s3a.committer.staging.uuid` in staging committer options list ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't handle task failure. loss or corruption of generated data** -To address these problems there is now explicit support in the `hadop-aws` -module for committing work to Amazon S3 via the S3A filesystem client, -*the S3A Committers* +To address these problems there is now explicit support in the `hadoop-aws` +module for committing work to Amazon S3 via the S3A filesystem client: +*the S3A Committers*. For safe, as well as high-performance output of work to S3, -we need use "a committer" explicitly written to work with S3, treating it as -an object store with special features. +we need to use "a committer" explicitly written to work with S3, +treating it as an object store with special features. -### Background : Hadoop's "Commit Protocol" +### Background: Hadoop's "Commit Protocol" How exactly is work written to its final destination? That is accomplished by a "commit protocol" between the workers and the job manager. Review Comment: line 109 could possibly be made clearer: A "Job" is the entire query, which takes a given input and produces an output. ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -283,40 +281,37 @@ new data to an existing partitioned directory tree is a common operation. </property> ``` -**replace** : when the job is committed (and not before), delete files in +The _Directory Committer_ uses the entire directory tree for conflict resolution. +For this committer, the behavior of each conflict mode is shown below: + Review Comment: looks like it's append based on the config table later, let's mention it here as well. ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -68,9 +68,9 @@ process across the cluster may rename a file or directory to the same path. If the rename fails for any reason, either the data is at the original location, Review Comment: missing full stop on line 54. ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -180,8 +181,8 @@ and restarting the job. whose output is in the job attempt directory, *and only rerunning all uncommitted tasks*. -This algorithm does not works safely or swiftly with AWS S3 storage because -tenames go from being fast, atomic operations to slow operations which can fail partway through. +This algorithm does not work safely or swiftly with AWS S3 storage because +renames go from being fast, atomic operations to slow operations which can fail partway through. This then is the problem which the S3A committers address: Review Comment: consider adding a full stop to 188 ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -474,7 +466,7 @@ files which do not contain relevant data. What the partitioned committer does is, where the tooling permits, allows callers to add data to an existing partitioned layout*. Review Comment: Consider rephrasing. "If tool permits, the partitioned committer allows callers to add data to an existing partitioned layout." also remove * before the full stop ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -492,18 +484,19 @@ was written. With the policy of `append`, the new file would be added to the existing set of files. -### Notes +### Notes on using Staging Committers 1. A deep partition tree can itself be a performance problem in S3 and the s3a client, -or, more specifically. a problem with applications which use recursive directory tree +or more specifically a problem with applications which use recursive directory tree walks to work with data. 1. The outcome if you have more than one job trying simultaneously to write data to the same destination with any policy other than "append" is undefined. 1. In the `append` operation, there is no check for conflict with file names. -If, in the example above, the file `log-20170228.avro` already existed, -it would be overridden. Set `fs.s3a.committer.staging.unique-filenames` to `true` +If the file `log-20170228.avro` in the example above already existed, it would be overwritten. + + Set `fs.s3a.committer.staging.unique-filenames` to `true` Review Comment: Indentation isn't right here. I think it makes sense to have this as part of point 3, consider reverting this change ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't handle task failure. loss or corruption of generated data** Review Comment: add a full stop to line 88 ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -88,17 +88,17 @@ proportional to the amount of data created. It still can't handle task failure. loss or corruption of generated data** -To address these problems there is now explicit support in the `hadop-aws` -module for committing work to Amazon S3 via the S3A filesystem client, -*the S3A Committers* +To address these problems there is now explicit support in the `hadoop-aws` +module for committing work to Amazon S3 via the S3A filesystem client: +*the S3A Committers*. For safe, as well as high-performance output of work to S3, -we need use "a committer" explicitly written to work with S3, treating it as -an object store with special features. +we need to use "a committer" explicitly written to work with S3, +treating it as an object store with special features. -### Background : Hadoop's "Commit Protocol" +### Background: Hadoop's "Commit Protocol" How exactly is work written to its final destination? That is accomplished by a "commit protocol" between the workers and the job manager. Review Comment: Typo on line 129. Guessing it should be a separate bullet point ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -165,6 +165,7 @@ that the network has partitioned and that they must abort their work. That's "essentially" it. When working with HDFS and similar filesystems, Review Comment: I think it makes sense to swap lines 147 and 145. As individual workers will communicate with the job manager first, and then it will make a decision to commit or abort ########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -530,18 +527,22 @@ performance. ### Enabling the committer +Set the committer used by S3A's committer factory to `magic`: + ```xml Review Comment: ok they're listed here: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/bk_cloud-data-access/content/s3-config-parameters.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org