steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism URL: https://github.com/apache/spark/pull/21066#issuecomment-505090596 @felixcheung : For this to work you need (a) the source on your CP and (b) the settings documented here: https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html Hortonworks docs: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/bk_cloud-data-access/content/ch03s08s01.html Delivers good numbers, the one we've been enabling is the staging one, which is derived from the code that Ryan was using @ netflix...it writes to the local HDD and then uploads; there's still task commit overhead but no files are manifest in the destination dir until job commit. Furthermore, within a single query, you don't need a consistent S3 store. (you do across queries, or a long enough gap for things to stabilise) email me and I'll help you get setup here
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
