To rephrase my earlier email, PyPI users would care about the bundled Hadoop version if they have a workflow that, in effect, looks something like this:
``` pip install pyspark pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 spark.read.parquet('s3a://...') ``` I agree that Hadoop 3 would be a better default (again, the s3a support is just much better). But to Xiao's point, if you are expecting Spark to work with some package like hadoop-aws that assumes an older version of Hadoop bundled with Spark, then changing the default may break your workflow. In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that would be more difficult to repair. 🤷♂️ On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sro...@gmail.com> wrote: > I'm also genuinely curious when PyPI users would care about the > bundled Hadoop jars - do we even need two versions? that itself is > extra complexity for end users. > I do think Hadoop 3 is the better choice for the user who doesn't > care, and better long term. > OK but let's at least move ahead with changing defaults. > > On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <lix...@databricks.com> wrote: > > > > Hi, Dongjoon, > > > > Please do not misinterpret my point. I already clearly said "I do not > know how to track the popularity of Hadoop 2 vs Hadoop 3." > > > > Also, let me repeat my opinion: the top priority is to provide two > options for PyPi distribution and let the end users choose the ones they > need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any > breaking change, let us follow our protocol documented in > https://spark.apache.org/versioning-policy.html. > > > > If you just want to change the Jenkins setup, I am OK about it. If you > want to change the default distribution, we need more discussions in the > community for getting an agreement. > > > > Thanks, > > > > Xiao > > >