Similar to Xiao, my major concern about making Hadoop 3.2 the default
Hadoop version is quality control. The current hadoop-3.2 profile covers
too many major component upgrades, i.e.:

   - Hadoop 3.2
   - Hive 2.3
   - JDK 11

We have already found and fixed some feature and performance regressions
related to these upgrades. Empirically, I’m not surprised at all if more
regressions are lurking somewhere. On the other hand, we do want help from
the community to help us to evaluate and stabilize these new changes.
Following that, I’d like to propose:

   1.

   Introduce a new profile hive-2.3 to enable (hopefully) less risky
   Hadoop/Hive/JDK version combinations.

   This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
   profile, so that users may try out some less risky Hadoop/Hive/JDK
   combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
   face potential regressions introduced by the Hadoop 3.2 upgrade.

   Yuming Wang has already sent out PR #26533
   <https://github.com/apache/spark/pull/26533> to exercise the Hadoop 2.7
   + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3
   profile yet), and the result looks promising: the Kafka streaming and Arrow
   related test failures should be irrelevant to the topic discussed here.

   After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot
   of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop
   version. For users who are still using Hadoop 2.x in production, they will
   have to use a hadoop-provided prebuilt package or build Spark 3.0
   against their own 2.x version anyway. It does make a difference for cloud
   users who don’t use Hadoop at all, though. And this probably also helps to
   stabilize the Hadoop 3.2 code path faster since our PR builder will
   exercise it regularly.
   2.

   Defer Hadoop 2.x upgrade to Spark 3.1+

   I personally do want to bump our Hadoop 2.x version to 2.9 or even 2.10.
   Steve has already stated the benefits very well. My worry here is still
   quality control: Spark 3.0 has already had tons of changes and major
   component version upgrades that are subject to all kinds of known and
   hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
   it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
   to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
   next 1 or 2 Spark 3.x releases.

Cheng

On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <ko...@tresata.com> wrote:

> i get that cdh and hdp backport a lot and in that way left 2.7 behind. but
> they kept the public apis stable at the 2.7 level, because thats kind of
> the point. arent those the hadoop apis spark uses?
>
> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran <ste...@cloudera.com.invalid>
> wrote:
>
>>
>>
>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>> <ste...@cloudera.com.invalid> wrote:
>>>
>>>> It would be really good if the spark distributions shipped with later
>>>> versions of the hadoop artifacts.
>>>>
>>>
>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>> make it Hadoop 2.8 or something newer?
>>>
>>
>> go for 2.9
>>
>>>
>>> Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile
>>>> to latest would probably be an issue for us.
>>>
>>>
>>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do
>>> we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>
>>
>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>> large proportion of the later branch-2 patches are backported. 2,7 was left
>> behind a long time ago
>>
>>
>>
>>
>

Reply via email to