I was trying to make my email short and concise, but the rationale behind
setting that as 1 by default is because it's safer. With algorithm version
2 you run the risk of having bad data in cases where tasks fail or even
duplicate data if a task fails and succeeds on a reattempt (I don't know if
this is true for all OutputCommitters that extend the FileOutputCommitter
or not).

Imran and Marcelo also discussed this here:
https://issues.apache.org/jira/browse/SPARK-20107?focusedCommentId=15945177&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945177

I also did discuss this a bit with Steve Loughran and his opinion was that
v2 should just be deprecated all together. I believe he was going to bring
that up with the Hadoop developers.


On Thu, Jun 25, 2020 at 3:56 PM Sean Owen <sro...@gmail.com> wrote:

> I think is a Hadoop property that is just passed through? if the
> default is different in Hadoop 3 we could mention that in the docs. i
> don't know if we want to always set it to 1 as a Spark default, even
> in Hadoop 3 right?
>
> On Thu, Jun 25, 2020 at 2:43 PM Waleed Fateem <waleed.fat...@gmail.com>
> wrote:
> >
> > Hello!
> >
> > I noticed that in the documentation starting with 2.2.0 it states that
> the parameter spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
> is 1 by default:
> > https://issues.apache.org/jira/browse/SPARK-20107
> >
> > I don't actually see this being set anywhere explicitly in the Spark
> code and so the documentation isn't entirely accurate in case you run on an
> environment that has MAPREDUCE-6406 implemented (starting with Hadoop 3.0).
> >
> > The default version was explicitly set to 2 in the FileOutputCommitter
> class, so any output committer that inherits from this class
> (ParquetOutputCommitter for example) would use v2 in a Hadoop 3.0
> environment and v1 in the older Hadoop environments.
> >
> > Would it make sense for us to consider setting v1 as the default in code
> in case the configuration was not set by a user?
> >
> > Regards,
> >
> > Waleed
>

Reply via email to