[GitHub] [spark] tgravescs commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

GitBox Tue, 29 Sep 2020 10:31:40 -0700


tgravescs commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700866136



   I'm fine with changing the default. I was trying to figure out cases when a 
user would really see this.  
   
   The MapReduce paradigm and Spark rely on the output of tasks being 
deterministic. If they are not they have other issues with retries and the 
output has no guarantees.  I thought Spark had deterministic output path naming 
but I was just starting to make sure I was remembering properly. 
   
   If those are true. I think that just leaves the _SUCCESS file thing. Which I 
can see if people don't check would be a problem.
   
   Are there cases I'm missing here?  Are there cases cloud providers or other 
tools are changing the output paths or something? @steveloughran  did you see 
this in a particular situation?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tgravescs commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Reply via email to