steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism URL: https://github.com/apache/spark/pull/21066#issuecomment-474874086 I've not tested it at all with Dynamic Partitioned Overwrite, as that code contains assumptions about the destination being a filesystem which may not hold consistently with the object store. The new PathOutputCommit in hadoop removes the simple view that things go to a filesystem, only that you can ask for a path to a filesystem to write the task data. While it holds for the S3A committers, we've left that design very open for others to plug in new committers for different filesystems (try it!) and overwrite it with a job specific one. Have a look [in the hadoop source|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/PathOutputCommitterFactory.java] its designed to let you something really fun: switch the commit algorithms underneath applications (generally) without them noticing. It'd have been completely transparent if I tried to patch FileOutputCommitter itself, but spend time stepping through that code with a debugger and you'll understand why its too complex and mission-critical to go near. I was scared.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
