steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit 
protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
URL: https://github.com/apache/spark/pull/21066#issuecomment-474039157
 
 
   > I am not clear why it should throw this exception? What would happen if 
this code is opened.
   
   > I tried to decode the comment on the parameter, " dynamic partition 
overwrite is not supported, so that committers for stores which do not support 
rename will not get confused.", but I got bit confused. 
   
   oops :)
   
   the dynamicPartitionOverwrite was something which came in with HADOOP-20326, 
but assumes that the committer is always one which works with it, (as the 
filesystem ones can do). The PathOutputCommitter interface inserted into 
hadoop-mapreduce is a bit more relaxed than that, and things break.
   
   bq. when DynamicOverwrite is set, the committer(Directory/partition) will 
delete the existing contents(in REPLACE mode).
   
   Hadoop HDFS &c, should do. For the [S3A 
committers](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md),
 things are subtly different in how conflicts are handled.
   
   In particular, the Ryan Blue's *partioned* committer has special handling 
for writing out to a directory tree, where the conflict policy (replace, 
append, fail) _is only applied to the final directories at the end of the 
partition tree_ 
[/PartitionedStagingCommitter.java#L122](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/PartitionedStagingCommitter.java#L122)
   
   This lets you do work which can update in place a directory full of data, 
ignoring all directories which don't contain new data from the current job. It 
can instead be set to only fail if the final destinations exist, or, if you 
configure it to delete files, it will purge all existing data in that 
destination path, but nowhere else
   
   Example: imagine a query which generates data in 
s3a/dest/year=2019/month=03/day=18
   
   if the destination path has data in  s3a/dest/year=2019/month=03/day=17, 
there'll be no confict. If there is something in 18,. then the new query can 
either: add new files (remember, it defaults to giving each file a UUID in its 
name), delete the old ones before adding the new files, or fail.
   
   Is that clearer? It's designed for in-place updates of very-large datasets 
without having to rename/move any output files after.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to