steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit 
protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
URL: https://github.com/apache/spark/pull/21066#issuecomment-474874086
 
 
   I've not tested it at all with Dynamic Partitioned Overwrite, as that code 
contains assumptions about the destination being a filesystem which may not 
hold consistently with the object store. The new PathOutputCommit in hadoop 
removes the simple view that things go to a filesystem, only that you can ask 
for a path to a filesystem to write the task data. While it holds for the S3A 
committers, we've left that design very open for others to plug in new 
committers for different filesystems (try it!) and overwrite it with a job 
specific one. 
   
   Have a look [in the hadoop 
source|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java]
   
   its designed to let you something really fun: switch the commit algorithms 
underneath applications (generally) without them noticing. It'd have been 
completely transparent if I tried to patch FileOutputCommitter itself, but 
spend time stepping through that code with a debugger and you'll understand why 
its too complex and mission-critical to go near. I was scared. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to