Re: [DISCUSS] Accelerate Hadoop dependency updates

Wei-Chiu Chuang Thu, 12 Mar 2020 17:58:12 -0700

That is unfortunately true.

Now that I recognize the impact of guava update in Hadoop 3.1/3.2, how can
we make this better for downstreamers to consume? Like I proposed, I think
a middle ground is to shade guava in hadoop-thirdparty, and include the
hadoop-thirdparty jar in the next Hadoop 3.1/3.2 release.




On Thu, Mar 12, 2020 at 12:03 AM Igor Dvorzhak <[email protected]>
wrote:

> How do you manage and version such dependency upgrades in subminor
> Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
> upgrades will be breaking for customers and can not be shipped in subminor
> CDH release? Or this is in preparation for the next major/minor release of
> CDH?
>
> On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
> <[email protected]> wrote:
>
>> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
>> Apache version are they going to land, but we'll upstream them for sure.
>>
>> The guava change is debatable. It's not as critical as others. There are
>> critical vulnerabilities in other dependencies that we have no way but to
>> update to a new major/minor version because we are so far behind. And
>> given
>> the critical nature, I think it is worth the risk and backport to lower
>> maintenance releases is warranted. Moreover, our minor releases are at
>> best
>> 1 per year. That is too slow to respond to a critical vulnerability.
>>
>> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <[email protected]>
>> wrote:
>>
>> > Generally I'm for updating dependencies, but I think that Hadoop should
>> > stick with semantic versioning and do not make major and
>> > minor dependency updates in subminor releases.
>> >
>> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of
>> this
>> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
>> > support Guava 27.0-jre.
>> >
>> > It would be better to make dependency upgrades when releasing new
>> > major/minor versions, for example Guava 27.0-jre upgrade was more
>> > appropriate for Hadoop 3.3.0 release than 3.2.1.
>> >
>> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
>> > <[email protected]> wrote:
>> >
>> >> I'm not hearing any feedback so far, but I want to suggest:
>> >>
>> >> use hadoop-thirdparty repository to host any dependencies that are
>> known
>> >> to
>> >> break compatibility.
>> >>
>> >> Candidate #1 guava
>> >> Candidate #2 Netty
>> >> Candidate #3 Jetty
>> >>
>> >> in fact, HBase shades these dependencies for the exact same reason.
>> >>
>> >> As an example of the cost of compatibility breakage: we spent the last
>> 6
>> >> months to backport the guava update change (guava 11 --> 27) throughout
>> >> Cloudera's stack, and after 6 months we are not done yet because we
>> have
>> >> to
>> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
>> >> guava
>> >> is in the classpath of every application.
>> >>
>> >> Thoughts?
>> >>
>> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <[email protected]>
>> >> wrote:
>> >>
>> >> > Hi Hadoop devs,
>> >> >
>> >> > I the past, Hadoop tends to be pretty far behind the latest versions
>> of
>> >> > dependencies. Part of that is due to the fear of the breaking changes
>> >> > brought in by the dependency updates.
>> >> >
>> >> > However, things have changed dramatically over the past few years.
>> With
>> >> > more focus on security vulnerabilities, more vulnerabilities are
>> >> discovered
>> >> > in our dependencies, and users put more pressure on patching Hadoop
>> (and
>> >> > its ecosystem) to use the latest dependency versions.
>> >> >
>> >> > As an example, Jackson-databind had 20 CVEs published in the last
>> year
>> >> > alone.
>> >> >
>> >>
>> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>> >> >
>> >> > Jetty: 4 CVEs in 2019:
>> >> >
>> >>
>> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>> >> >
>> >> > We can no longer keep Hadoop stay behind. The more we stay behind,
>> the
>> >> > harder it is to update. A good example is Jersey migration 1 -> 2
>> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
>> >> contributed
>> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
>> >> hard.
>> >> > If any critical vulnerability is found in Jersey 1, it will leave us
>> in
>> >> a
>> >> > bad situation since we can't simply update Jersey version and be
>> done.
>> >> >
>> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
>> >> > should advocate downstream applications to use the public artifacts
>> to
>> >> > avoid breakage.
>> >> >
>> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
>> with
>> >> > the latest dependency updates, or would rather stay behind to ensure
>> >> > compatibility?
>> >> >
>> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
>> >> for
>> >> > the same purpose. IMHO that'll require better infrastructure to
>> assist
>> >> the
>> >> > release work and some rethinking our current Hadoop code structure,
>> like
>> >> > separate each subproject into its own repository and release cadence.
>> >> This
>> >> > can be controversial but I think it'll be good for the project in the
>> >> long
>> >> > run.
>> >> >
>> >> > Thanks,
>> >> > Wei-Chiu
>> >> >
>> >>
>> >
>>
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Reply via email to