Cc Yuming, Steve, and Dongjoon On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <lian.cs....@gmail.com> wrote:
> Similar to Xiao, my major concern about making Hadoop 3.2 the default > Hadoop version is quality control. The current hadoop-3.2 profile covers > too many major component upgrades, i.e.: > > - Hadoop 3.2 > - Hive 2.3 > - JDK 11 > > We have already found and fixed some feature and performance regressions > related to these upgrades. Empirically, I’m not surprised at all if more > regressions are lurking somewhere. On the other hand, we do want help from > the community to help us to evaluate and stabilize these new changes. > Following that, I’d like to propose: > > 1. > > Introduce a new profile hive-2.3 to enable (hopefully) less risky > Hadoop/Hive/JDK version combinations. > > This new profile allows us to decouple Hive 2.3 from the hadoop-3.2 > profile, so that users may try out some less risky Hadoop/Hive/JDK > combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to > face potential regressions introduced by the Hadoop 3.2 upgrade. > > Yuming Wang has already sent out PR #26533 > <https://github.com/apache/spark/pull/26533> to exercise the Hadoop > 2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3 > profile yet), and the result looks promising: the Kafka streaming and Arrow > related test failures should be irrelevant to the topic discussed here. > > After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot > of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop > version. For users who are still using Hadoop 2.x in production, they will > have to use a hadoop-provided prebuilt package or build Spark 3.0 > against their own 2.x version anyway. It does make a difference for cloud > users who don’t use Hadoop at all, though. And this probably also helps to > stabilize the Hadoop 3.2 code path faster since our PR builder will > exercise it regularly. > 2. > > Defer Hadoop 2.x upgrade to Spark 3.1+ > > I personally do want to bump our Hadoop 2.x version to 2.9 or even > 2.10. Steve has already stated the benefits very well. My worry here is > still quality control: Spark 3.0 has already had tons of changes and major > component version upgrades that are subject to all kinds of known and > hidden regressions. Having Hadoop 2.7 there provides us a safety net, since > it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7 > to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the > next 1 or 2 Spark 3.x releases. > > Cheng > > On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <ko...@tresata.com> wrote: > >> i get that cdh and hdp backport a lot and in that way left 2.7 behind. >> but they kept the public apis stable at the 2.7 level, because thats kind >> of the point. arent those the hadoop apis spark uses? >> >> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran >> <ste...@cloudera.com.invalid> wrote: >> >>> >>> >>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas < >>> nicholas.cham...@gmail.com> wrote: >>> >>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran >>>> <ste...@cloudera.com.invalid> wrote: >>>> >>>>> It would be really good if the spark distributions shipped with later >>>>> versions of the hadoop artifacts. >>>>> >>>> >>>> I second this. If we need to keep a Hadoop 2.x profile around, why not >>>> make it Hadoop 2.8 or something newer? >>>> >>> >>> go for 2.9 >>> >>>> >>>> Koert Kuipers <ko...@tresata.com> wrote: >>>> >>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile >>>>> to latest would probably be an issue for us. >>>> >>>> >>>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do >>>> we want to wait for them to bump to Hadoop 2.8 before we do the same? >>>> >>> >>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really >>> large proportion of the later branch-2 patches are backported. 2,7 was left >>> behind a long time ago >>> >>> >>> >>> >>