Cc Yuming, Steve, and Dongjoon

On Fri, Nov 15, 2019 at 10:37 AM Cheng Lian <lian.cs....@gmail.com> wrote:

> Similar to Xiao, my major concern about making Hadoop 3.2 the default
> Hadoop version is quality control. The current hadoop-3.2 profile covers
> too many major component upgrades, i.e.:
>
>    - Hadoop 3.2
>    - Hive 2.3
>    - JDK 11
>
> We have already found and fixed some feature and performance regressions
> related to these upgrades. Empirically, I’m not surprised at all if more
> regressions are lurking somewhere. On the other hand, we do want help from
> the community to help us to evaluate and stabilize these new changes.
> Following that, I’d like to propose:
>
>    1.
>
>    Introduce a new profile hive-2.3 to enable (hopefully) less risky
>    Hadoop/Hive/JDK version combinations.
>
>    This new profile allows us to decouple Hive 2.3 from the hadoop-3.2
>    profile, so that users may try out some less risky Hadoop/Hive/JDK
>    combinations: if you only want Hive 2.3 and/or JDK 11, you don’t need to
>    face potential regressions introduced by the Hadoop 3.2 upgrade.
>
>    Yuming Wang has already sent out PR #26533
>    <https://github.com/apache/spark/pull/26533> to exercise the Hadoop
>    2.7 + Hive 2.3 + JDK 11 combination (this PR does not have the hive-2.3
>    profile yet), and the result looks promising: the Kafka streaming and Arrow
>    related test failures should be irrelevant to the topic discussed here.
>
>    After decoupling Hive 2.3 and Hadoop 3.2, I don’t think it makes a lot
>    of difference between having Hadoop 2.7 or Hadoop 3.2 as the default Hadoop
>    version. For users who are still using Hadoop 2.x in production, they will
>    have to use a hadoop-provided prebuilt package or build Spark 3.0
>    against their own 2.x version anyway. It does make a difference for cloud
>    users who don’t use Hadoop at all, though. And this probably also helps to
>    stabilize the Hadoop 3.2 code path faster since our PR builder will
>    exercise it regularly.
>    2.
>
>    Defer Hadoop 2.x upgrade to Spark 3.1+
>
>    I personally do want to bump our Hadoop 2.x version to 2.9 or even
>    2.10. Steve has already stated the benefits very well. My worry here is
>    still quality control: Spark 3.0 has already had tons of changes and major
>    component version upgrades that are subject to all kinds of known and
>    hidden regressions. Having Hadoop 2.7 there provides us a safety net, since
>    it’s proven to be stable. To me, it’s much less risky to upgrade Hadoop 2.7
>    to 2.9/2.10 after we stabilize the Hadoop 3.2/Hive 2.3 combinations in the
>    next 1 or 2 Spark 3.x releases.
>
> Cheng
>
> On Mon, Nov 4, 2019 at 11:24 AM Koert Kuipers <ko...@tresata.com> wrote:
>
>> i get that cdh and hdp backport a lot and in that way left 2.7 behind.
>> but they kept the public apis stable at the 2.7 level, because thats kind
>> of the point. arent those the hadoop apis spark uses?
>>
>> On Mon, Nov 4, 2019 at 10:07 AM Steve Loughran
>> <ste...@cloudera.com.invalid> wrote:
>>
>>>
>>>
>>> On Mon, Nov 4, 2019 at 12:39 AM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> On Fri, Nov 1, 2019 at 8:41 AM Steve Loughran
>>>> <ste...@cloudera.com.invalid> wrote:
>>>>
>>>>> It would be really good if the spark distributions shipped with later
>>>>> versions of the hadoop artifacts.
>>>>>
>>>>
>>>> I second this. If we need to keep a Hadoop 2.x profile around, why not
>>>> make it Hadoop 2.8 or something newer?
>>>>
>>>
>>> go for 2.9
>>>
>>>>
>>>> Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>>> given that latest hdp 2.x is still hadoop 2.7 bumping hadoop 2 profile
>>>>> to latest would probably be an issue for us.
>>>>
>>>>
>>>> When was the last time HDP 2.x bumped their minor version of Hadoop? Do
>>>> we want to wait for them to bump to Hadoop 2.8 before we do the same?
>>>>
>>>
>>> The internal builds of CDH and HDP are not those of ASF 2.7.x. A really
>>> large proportion of the later branch-2 patches are backported. 2,7 was left
>>> behind a long time ago
>>>
>>>
>>>
>>>
>>

Reply via email to