[GitHub] [spark] steveloughran commented on issue #24970: [SPARK-23977][SQL] Support High Performance S3A committers [test-hadoop3.2]

GitBox Fri, 28 Jun 2019 10:14:40 -0700

steveloughran commented on issue #24970: [SPARK-23977][SQL] Support High 
Performance S3A committers [test-hadoop3.2]
URL: https://github.com/apache/spark/pull/24970#issuecomment-506810544
 
 
   I'll cut the feature (and test) and we could follow up with discussions 
about what to do.
   
   Regarding automatic binding, the MRv2 APIs in Hadoop will automatically 
bind, so, ignoring the special case of Parquet, any configured factory for a 
filesystem is instantiated and used -it's up to individual factories to decide 
what to do. (that is: any subclass of `FileOutputFormat` which doesn't override 
the `getOutputCommitter()` call will go through the factories.
   
   The key extra work in this code is to make sure that `getWorkPath()` is 
passed to the implementation: currently 
`HadoopMapReduceCommitProtocol.newTaskTempFile` casts its created committer to 
`FileOutputCommitter` for invoking. A big change into the new 
`PathOutputCommitter` interface is just to pull that method up into an 
interface, so that its possible to implement committers which export this 
feature. 
   
   I am happy to add that interface to Hadoop 2.x and shipping 3.x versions so 
that even the base class could cast to PathOutputCommitter to use this. 
Otherwise, I could change it to switch to reflection. Key point: ignoring 
Parquet quirks, the `PathOutputCommitProtocol` binding could be removed 
entirely.
   
   
   (oh, and I'm ignoring uses of the MRv1 APIs. happy to tag those as 
deprecated in hadoop-mapreduce.jar and help spark code migrate off them. Their 
time is over)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on issue #24970: [SPARK-23977][SQL] Support High Performance S3A committers [test-hadoop3.2]

Reply via email to