Thank you for suggestion. Having `hive-2.3` profile sounds good to me because it's orthogonal to Hadoop 3. IIRC, originally, it was proposed in that way, but we put it under `hadoop-3.2` to avoid adding new profiles at that time.
And, I'm wondering if you are considering additional pre-built distribution and Jenkins jobs. Bests, Dongjoon. On Fri, Nov 15, 2019 at 1:38 PM Cheng Lian <lian.cs....@gmail.com> wrote: > Cc Yuming, Steve, and Dongjoon > > On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <lian.cs....@gmail.com> wrote: > >> Similar to Xiao, my major concern about making Hadoop 3.2 the default >> Hadoop version is quality control. The current hadoop-3.2 profile covers >> too many major component upgrades, i.e.: >> >> - Hadoop 3.2 >> - Hive 2.3 >> - JDK 11 >> >> We have already found and fixed some feature and performance regressions >> related to these upgrades. Empirically, I’m not surprised at all if more >> regressions are lurking somewhere. On the other hand, we do want help from >> the community to help us to evaluate and stabilize these new changes. >> Following that, I’d like to propose: >> >> 1. >> >> Introduce a new profile hive-2.3 to enable (hopefully) less risky >> Hadoop/Hive/JDK version combinations. >> >> This new profile allows us to decouple Hive 2.3 from the hadoop-3.2 >> profile, so that users may try out some less risky Hadoop/Hive/JDK >> combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to >> face potential regressions introduced by the Hadoop 3.2 upgrade. >> >> Yuming Wang has already sent out PR #26533 >> <https://github.com/apache/spark/pull/26533> to exercise the Hadoop >> 2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the >> hive-2.3 profile yet), and the result looks promising: the Kafka >> streaming and Arrow related test failures should be irrelevant to the >> topic >> discussed here. >> >> After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a >> lot of difference between having Hadoop 2.7 or Hadoop 3.2 as the default >> Hadoop version. For users who are still using Hadoop 2.x in production, >> they will have to use a hadoop-provided prebuilt package or build >> Spark 3.0 against their own 2.x version anyway. It does make a difference >> for cloud users who don’t use Hadoop at all, though. And this probably >> also >> helps to stabilize the Hadoop 3.2 code path faster since our PR builder >> will exercise it regularly. >> 2. >> >> Defer Hadoop 2.x upgrade to Spark 3.1+ >> >> I personally do want to bump our Hadoop 2.x version to 2.9 or even >> 2.10. Steve has already stated the benefits very well. My worry here is >> still quality control: Spark 3.0 has already had tons of changes and major >> component version upgrades that are subject to all kinds of known and >> hidden regressions. Having Hadoop 2.7 there provides us a safety net, >> since >> it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop >> 2.7 >> to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the >> next 1 or 2 Spark 3.x releases. >> >> Cheng >> >> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <ko...@tresata.com> wrote: >> >>> i get that cdh and hdp backport a lot and in that way left 2.7 behind. >>> but they kept the public apis stable at the 2.7 level, because thats kind >>> of the point. arent those the hadoop apis spark uses? >>> >>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran >>> <ste...@cloudera.com.invalid> wrote: >>> >>>> >>>> >>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas < >>>> nicholas.cham...@gmail.com> wrote: >>>> >>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran >>>>> <ste...@cloudera.com.invalid> wrote: >>>>> >>>>>> It would be really good if the spark distributions shipped with later >>>>>> versions of the hadoop artifacts. >>>>>> >>>>> >>>>> I second this. If we need to keep a Hadoop 2.x profile around, why not >>>>> make it Hadoop 2.8 or something newer? >>>>> >>>> >>>> go for 2.9 >>>> >>>>> >>>>> Koert Kuipers <ko...@tresata.com> wrote: >>>>> >>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 >>>>>> profile to latest would probably be an issue for us. >>>>> >>>>> >>>>> When was the last time HDP 2.x bumped their minor version of Hadoop? >>>>> Do we want to wait for them to bump to Hadoop 2.8 before we do the same? >>>>> >>>> >>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really >>>> large proportion of the later branch-2 patches are backported. 2,7 was left >>>> behind a long time ago >>>> >>>> >>>> >>>> >>>