On Tue, Nov 19, 2019 at 10:40 PM Cheng Lian <lian.cs....@gmail.com> wrote:
> Hey Steve, > > In terms of Maven artifact, I don't think the default Hadoop version > matters except for the spark-hadoop-cloud module, which is only meaningful > under the hadoop-3.2 profile. All the other spark-* artifacts published to > Maven central are Hadoop-version-neutral. > It's more that everyone using it has to do the game of excluding all the old artifacts and requesting the new dependencies -including working out what the spark poms excluded from their imports of later versions of things. > > Another issue about switching the default Hadoop version to 3.2 is PySpark > distribution. Right now, we only publish PySpark artifacts prebuilt with > Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to > 3.2 is feasible for PySpark users. Or maybe we should publish PySpark > prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one. > > Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the > proposed hive-2.3 profile, I personally don't have a preference over having > Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing > the release management work, in case we decided to publish other spark-* > Maven artifacts from a Hadoop 2.7 build, we can still special case > spark-hadoop-cloud and publish it using a hadoop-3.2 build. > that would really complicate life on maven. sticking a version on mvn central with the 3.2 dependencies consistently would be better >>>