Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Cheng Lian Tue, 19 Nov 2019 14:41:12 -0800

Hey Steve,

In terms of Maven artifact, I don't think the default Hadoop version
matters except for the spark-hadoop-cloud module, which is only meaningful
under the hadoop-3.2 profile. All  the other spark-* artifacts published to
Maven central are Hadoop-version-neutral.


Another issue about switching the default Hadoop version to 3.2 is PySpark
distribution. Right now, we only publish PySpark artifacts prebuilt with
Hadoop 2.x to PyPI. I'm not sure whether bumping the Hadoop dependency to
3.2 is feasible for PySpark users. Or maybe we should publish PySpark
prebuilt with both Hadoop 2.x and 3.x. I'm open to suggestions on this one.

Again, as long as Hive 2.3 and Hadoop 3.2 upgrade can be decoupled via the
proposed hive-2.3 profile, I personally don't have a preference over having
Hadoop 2.7 or 3.2 as the default Hadoop version. But just for minimizing
the release management work, in case we decided to publish other spark-*
Maven artifacts from a Hadoop 2.7 build, we can still special case
spark-hadoop-cloud and publish it using a hadoop-3.2 build.

On Mon, Nov 18, 2019 at 8:39 PM Dongjoon Hyun <[email protected]>
wrote:

> I also agree with Steve and Felix.
>
> Let's have another thread to discuss Hive issue
>
> because this thread was originally for `hadoop` version.
>
> And, now we can have `hive-2.3` profile for both `hadoop-2.7` and
> `hadoop-3.0` versions.
>
> We don't need to mix both.
>
> Bests,
> Dongjoon.
>
>
> On Mon, Nov 18, 2019 at 8:19 PM Felix Cheung <[email protected]>
> wrote:
>
>> 1000% with Steve, the org.spark-project hive 1.2 will need a solution. It
>> is old and rather buggy; and It’s been *years*
>>
>> I think we should decouple hive change from everything else if people are
>> concerned?
>>
>> ------------------------------
>> *From:* Steve Loughran <[email protected]>
>> *Sent:* Sunday, November 17, 2019 9:22:09 AM
>> *To:* Cheng Lian <[email protected]>
>> *Cc:* Sean Owen <[email protected]>; Wenchen Fan <[email protected]>;
>> Dongjoon Hyun <[email protected]>; dev <[email protected]>;
>> Yuming Wang <[email protected]>
>> *Subject:* Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?
>>
>> Can I take this moment to remind everyone that the version of hive which
>> spark has historically bundled (the org.spark-project one) is an orphan
>> project put together to deal with Hive's shading issues and a source of
>> unhappiness in the Hive project. What ever get shipped should do its best
>> to avoid including that file.
>>
>> Postponing a switch to hadoop 3.x after spark 3.0 is probably the safest
>> move from a risk minimisation perspective. If something has broken then it
>> is you can start with the assumption that it is in the o.a.s packages
>> without having to debug o.a.hadoop and o.a.hive first. There is a cost: if
>> there are problems with the hadoop / hive dependencies those teams will
>> inevitably ignore filed bug reports for the same reason spark team will
>> probably because 1.6-related JIRAs as WONTFIX. WONTFIX responses for the
>> Hadoop 2.x line include any compatibility issues with Java 9+. Do bear that
>> in mind. It's not been tested, it has dependencies on artifacts we know are
>> incompatible, and as far as the Hadoop project is concerned: people should
>> move to branch 3 if they want to run on a modern version of Java
>>
>> It would be really really good if the published spark maven artefacts (a)
>> included the spark-hadoop-cloud JAR and (b) were dependent upon hadoop 3.x.
>> That way people doing things with their own projects will get up-to-date
>> dependencies and don't get WONTFIX responses themselves.
>>
>> -Steve
>>
>> PS: Discussion on hadoop-dev @ making Hadoop 2.10 the official "last
>> ever" branch-2 release and then declare its predecessors EOL; 2.10 will be
>> the transition release.
>>
>> On Sun, Nov 17, 2019 at 1:50 AM Cheng Lian <[email protected]> wrote:
>>
>> Dongjoon, I didn't follow the original Hive 2.3 discussion closely. I
>> thought the original proposal was to replace Hive 1.2 with Hive 2.3, which
>> seemed risky, and therefore we only introduced Hive 2.3 under the
>> hadoop-3.2 profile without removing Hive 1.2. But maybe I'm totally wrong
>> here...
>>
>> Sean, Yuming's PR https://github.com/apache/spark/pull/26533 showed that
>> Hadoop 2 + Hive 2 + JDK 11 looks promising. My major motivation is not
>> about demand, but risk control: coupling Hive 2.3, Hadoop 3.2, and JDK 11
>> upgrade together looks too risky.
>>
>> On Sat, Nov 16, 2019 at 4:03 AM Sean Owen <[email protected]> wrote:
>>
>> I'd prefer simply not making Hadoop 3 the default until 3.1+, rather
>> than introduce yet another build combination. Does Hadoop 2 + Hive 2
>> work and is there demand for it?
>>
>> On Sat, Nov 16, 2019 at 3:52 AM Wenchen Fan <[email protected]> wrote:
>> >
>> > Do we have a limitation on the number of pre-built distributions? Seems
>> this time we need
>> > 1. hadoop 2.7 + hive 1.2
>> > 2. hadoop 2.7 + hive 2.3
>> > 3. hadoop 3 + hive 2.3
>> >
>> > AFAIK we always built with JDK 8 (but make it JDK 11 compatible), so
>> don't need to add JDK version to the combination.
>> >
>> > On Sat, Nov 16, 2019 at 4:05 PM Dongjoon Hyun <[email protected]>
>> wrote:
>> >>
>> >> Thank you for suggestion.
>> >>
>> >> Having `hive-2.3` profile sounds good to me because it's orthogonal to
>> Hadoop 3.
>> >> IIRC, originally, it was proposed in that way, but we put it under
>> `hadoop-3.2` to avoid adding new profiles at that time.
>> >>
>> >> And, I'm wondering if you are considering additional pre-built
>> distribution and Jenkins jobs.
>> >>
>> >> Bests,
>> >> Dongjoon.
>> >>
>>
>>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Reply via email to