To rephrase my earlier email, PyPI users would care about the bundled
Hadoop version if they have a workflow that, in effect, looks something
like this:

```
pip install pyspark
pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
spark.read.parquet('s3a://...')
```

I agree that Hadoop 3 would be a better default (again, the s3a support is
just much better). But to Xiao's point, if you are expecting Spark to work
with some package like hadoop-aws that assumes an older version of Hadoop
bundled with Spark, then changing the default may break your workflow.

In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to
hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
would be more difficult to repair. 🤷‍♂️

On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sro...@gmail.com> wrote:

> I'm also genuinely curious when PyPI users would care about the
> bundled Hadoop jars - do we even need two versions? that itself is
> extra complexity for end users.
> I do think Hadoop 3 is the better choice for the user who doesn't
> care, and better long term.
> OK but let's at least move ahead with changing defaults.
>
> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <lix...@databricks.com> wrote:
> >
> > Hi, Dongjoon,
> >
> > Please do not misinterpret my point. I already clearly said "I do not
> know how to track the popularity of Hadoop 2 vs Hadoop 3."
> >
> > Also, let me repeat my opinion:  the top priority is to provide two
> options for PyPi distribution and let the end users choose the ones they
> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
> breaking change, let us follow our protocol documented in
> https://spark.apache.org/versioning-policy.html.
> >
> > If you just want to change the Jenkins setup, I am OK about it. If you
> want to change the default distribution, we need more discussions in the
> community for getting an agreement.
> >
> >  Thanks,
> >
> > Xiao
> >
>

Reply via email to