waleedfateem opened a new pull request #29541:
URL: https://github.com/apache/spark/pull/29541


   The current documentation states that the default value of 
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not 
entirely true since this configuration isn't set anywhere in Spark but rather 
inherited from the Hadoop FileOutputCommitter class.  
   
   ### What changes were proposed in this pull request?
   
   I'm submitting this change, to clarify that the default value will entirely 
depend on the Hadoop version of the runtime environment.
   
   ### Why are the changes needed?
   
   An application would end up using algorithm version 1 on certain 
environments but without any changes the same exact application will use 
version 2 on environments running Hadoop 3.0 and later. This can have pretty 
bad consequences in certain scenarios, for example, two tasks can partially 
overwrite their output if speculation is enabled. Also, please refer to the 
following JIRA:
   https://issues.apache.org/jira/browse/MAPREDUCE-7282
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Configuration page content was modified where previously we explicitly 
highlighted that the default version for the FileOutputCommitter algorithm was 
v1, this now has changed to "Dependent on environment" with additional 
information in the description column to elaborate.
   
   
   ### How was this patch tested?
   
   Checked changes locally in browser 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to