[GitHub] [spark] steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

GitBox Mon, 24 Jun 2019 09:49:50 -0700

steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit 
protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
URL: https://github.com/apache/spark/pull/21066#issuecomment-505090596
 
 
   @felixcheung : For this to work you need (a) the source on your CP and (b) 
the settings documented here: 
https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html
   
   Hortonworks docs: 
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/bk_cloud-data-access/content/ch03s08s01.html
   
   Delivers good numbers, the one we've been enabling is the staging one, which 
is derived from the code that Ryan was using @ netflix...it writes to the local 
HDD and then uploads; there's still task commit overhead but no files are 
manifest in the destination dir until job commit. Furthermore, within a single 
query, you don't need a consistent S3 store. (you do across queries, or a long 
enough gap for things to stabilise)
   
   email me and I'll help you get setup here


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

Reply via email to