steveloughran commented on issue #24970: [SPARK-23977][SQL] Support High Performance S3A committers [test-hadoop3.2] URL: https://github.com/apache/spark/pull/24970#issuecomment-506810544 I'll cut the feature (and test) and we could follow up with discussions about what to do. Regarding automatic binding, the MRv2 APIs in Hadoop will automatically bind, so, ignoring the special case of Parquet, any configured factory for a filesystem is instantiated and used -it's up to individual factories to decide what to do. (that is: any subclass of `FileOutputFormat` which doesn't override the `getOutputCommitter()` call will go through the factories. The key extra work in this code is to make sure that `getWorkPath()` is passed to the implementation: currently `HadoopMapReduceCommitProtocol.newTaskTempFile` casts its created committer to `FileOutputCommitter` for invoking. A big change into the new `PathOutputCommitter` interface is just to pull that method up into an interface, so that its possible to implement committers which export this feature. I am happy to add that interface to Hadoop 2.x and shipping 3.x versions so that even the base class could cast to PathOutputCommitter to use this. Otherwise, I could change it to switch to reflection. Key point: ignoring Parquet quirks, the `PathOutputCommitProtocol` binding could be removed entirely. (oh, and I'm ignoring uses of the MRv1 APIs. happy to tag those as deprecated in hadoop-mapreduce.jar and help spark code migrate off them. Their time is over)
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
