[GitHub] [spark] steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

GitBox Mon, 18 Mar 2019 11:14:28 -0700

steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit
protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
URL: https://github.com/apache/spark/pull/21066#issuecomment-474039157

> I am not clear why it should throw this exception? What would happen if
this code is opened.

> I tried to decode the comment on the parameter, " dynamic partition
overwrite is not supported, so that committers for stores which do not support
rename will not get confused.", but I got bit confused.

oops :)

the dynamicPartitionOverwrite was something which came in with HADOOP-20326,
but assumes that the committer is always one which works with it, (as the
filesystem ones can do). The PathOutputCommitter interface inserted into
hadoop-mapreduce is a bit more relaxed than that, and things break.

bq. when DynamicOverwrite is set, the committer(Directory/partition) will
delete the existing contents(in REPLACE mode).

Hadoop HDFS &c, should do. For the [S3A
committers](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md),
things are subtly different in how conflicts are handled.

In particular, the Ryan Blue's *partioned* committer has special handling
for writing out to a directory tree, where the conflict policy (replace,
append, fail) _is only applied to the final directories at the end of the
partition tree_
[/PartitionedStagingCommitter.java#L122](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/PartitionedStagingCommitter.java#L122)

This lets you do work which can update in place a directory full of data,
ignoring all directories which don't contain new data from the current job. It
can instead be set to only fail if the final destinations exist, or, if you
configure it to delete files, it will purge all existing data in that
destination path, but nowhere else

Example: imagine a query which generates data in
s3a/dest/year=2019/month=03/day=18

if the destination path has data in s3a/dest/year=2019/month=03/day=17,
there'll be no confict. If there is something in 18,. then the new query can
either: add new files (remember, it defaults to giving each file a UUID in its
name), delete the old ones before adding the new files, or fail.

Is that clearer? It's designed for in-place updates of very-large datasets
without having to rename/move any output files after.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

Reply via email to