[GitHub] [hadoop] dannycjones commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

2022-10-07 Thread GitBox


dannycjones commented on code in PR #4478:
URL: https://github.com/apache/hadoop/pull/4478#discussion_r989873742


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##
@@ -180,20 +180,20 @@ and restarting the job.
 whose output is in the job attempt directory, *and only rerunning all 
uncommitted tasks*.
 
 
-This algorithm does not works safely or swiftly with AWS S3 storage because 
-tenames go from being fast, atomic operations to slow operations which can 
fail partway through.
+This algorithm does not work safely or swiftly with AWS S3 storage because
+renames go from being fast, atomic operations to slow operations which can 
fail partway through.
 
 This then is the problem which the S3A committers address:
 
-*How to safely and reliably commit work to Amazon S3 or compatible object 
store*
+>*How to safely and reliably commit work to Amazon S3 or compatible object 
store.*

Review Comment:
   Fair point, it is not quoting anyone (or maybe Steve). I will let it join 
the previous sentence as I think that makes more sense.



##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##
@@ -356,19 +351,20 @@ task commit.
 
 However, it has extra requirements of the filesystem
 
-1. [Obsolete] It requires a consistent object store.
+1. The object store must be consistent.
 1. The S3A client must be configured to recognize interactions
-with the magic directories and treat them specially.
+with the magic directories and treat them as a special case.
 
-Now that Amazon S3 is consistent, the magic committer is enabled by default.
+Now that [Amazon S3 is consistent](https://aws.amazon.com/s3/consistency/),
+the magic committer is enabled by default.

Review Comment:
   Yes, I believe that's correct.
   
   I believe what we're saying here is that S3A's "magic path rewriting" where 
it only stages the writes to `__magic/` directories is now enabled by default.
   
   I will update this to be clearer, something like:
   
   ```suggestion
   the magic directory path rewriting is enabled by default.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] dannycjones commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

2022-10-07 Thread GitBox


dannycjones commented on code in PR #4478:
URL: https://github.com/apache/hadoop/pull/4478#discussion_r989873550


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##
@@ -530,18 +527,22 @@ performance.
 

Review Comment:
   will drop those!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] dannycjones commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

2022-06-23 Thread GitBox


dannycjones commented on code in PR #4478:
URL: https://github.com/apache/hadoop/pull/4478#discussion_r904911780


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##
@@ -474,7 +466,7 @@ files which do not contain relevant data.
 What the partitioned committer does is, where the tooling permits, allows 
callers
 to add data to an existing partitioned layout*.

Review Comment:
   for this one, i'm not sure if the `*` is referring to something. I will 
remove for now. Happy to add it back



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] dannycjones commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

2022-06-23 Thread GitBox


dannycjones commented on code in PR #4478:
URL: https://github.com/apache/hadoop/pull/4478#discussion_r904905036


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##
@@ -165,6 +165,7 @@ that the network has partitioned and that they must abort 
their work.
 That's "essentially" it. When working with HDFS and similar filesystems,

Review Comment:
   I agree - I will actually remove redundant "whether the output is to be 
*committed* or *aborted*" to hopefully make it clearer. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] dannycjones commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

2022-06-23 Thread GitBox


dannycjones commented on code in PR #4478:
URL: https://github.com/apache/hadoop/pull/4478#discussion_r904894123


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##
@@ -283,40 +281,37 @@ new data to an existing partitioned directory tree is a 
common operation.
 
 ```
 
-**replace** : when the job is committed (and not before), delete files in
+The _Directory Committer_ uses the entire directory tree for conflict 
resolution.
+For this committer, the behavior of each conflict mode is shown below:
+

Review Comment:
   I will add it above the XML example.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] dannycjones commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

2022-06-23 Thread GitBox


dannycjones commented on code in PR #4478:
URL: https://github.com/apache/hadoop/pull/4478#discussion_r904893346


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##
@@ -492,18 +484,19 @@ was written. With the policy of `append`, the new file 
would be added to
 the existing set of files.
 
 
-### Notes
+### Notes on using Staging Committers
 
 1. A deep partition tree can itself be a performance problem in S3 and the s3a 
client,
-or, more specifically. a problem with applications which use recursive 
directory tree
+or more specifically a problem with applications which use recursive directory 
tree
 walks to work with data.
 
 1. The outcome if you have more than one job trying simultaneously to write 
data
 to the same destination with any policy other than "append" is undefined.
 
 1. In the `append` operation, there is no check for conflict with file names.
-If, in the example above, the file `log-20170228.avro` already existed,
-it would be overridden. Set `fs.s3a.committer.staging.unique-filenames` to 
`true`
+If the file `log-20170228.avro` in the example above already existed, it would 
be overwritten.
+
+   Set `fs.s3a.committer.staging.unique-filenames` to `true`

Review Comment:
   Using the indentation like this I believe allows you to put the sentence on 
a new line but still part of the previous point.
   
   That being said, it is not obvious from the markdown and I cannot test the 
output HTML so I'll revert.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] dannycjones commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

2022-06-23 Thread GitBox


dannycjones commented on code in PR #4478:
URL: https://github.com/apache/hadoop/pull/4478#discussion_r904889810


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##
@@ -650,7 +639,14 @@ Conflict management is left to the execution engine itself.
 

Review Comment:
   Good call out, I had a look and I think that option should actually be just 
`fs.s3a.committer.uuid` - no `staging`. I'll update it.
   
   The code in `AbstractS3ACommitter` checks for `fs.s3a.committer.uuid` and 
the Spark UUID before failing fast.
   
   
https://github.com/apache/hadoop/blob/e199da3fae1c82e87f88c8c50f6a96c6515e2edd/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java#L1341-L1360



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[GitHub] [hadoop] dannycjones commented on a diff in pull request #4478: HADOOP-18304. Improve user-facing S3A committers documentation

2022-06-23 Thread GitBox


dannycjones commented on code in PR #4478:
URL: https://github.com/apache/hadoop/pull/4478#discussion_r904883423


##
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md:
##
@@ -530,18 +527,22 @@ performance.
 
 ### Enabling the committer
 
+Set the committer used by S3A's committer factory to `magic`:
+
 ```xml

Review Comment:
   It's a good question. I do worry we duplicate this in a lot of places, not 
sure what is the best option.
   
   I plan to leave it out of the scope of this patch though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org