Re: [DISCUSS] Accelerate Hadoop dependency updates

Wei-Chiu Chuang Tue, 10 Mar 2020 15:04:10 -0700

I'm not hearing any feedback so far, but I want to suggest:

use hadoop-thirdparty repository to host any dependencies that are known to
break compatibility.


Candidate #1 guava
Candidate #2 Netty
Candidate #3 Jetty

in fact, HBase shades these dependencies for the exact same reason.

As an example of the cost of compatibility breakage: we spent the last 6
months to backport the guava update change (guava 11 --> 27) throughout
Cloudera's stack, and after 6 months we are not done yet because we have to
update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
is in the classpath of every application.

Thoughts?

On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <weic...@apache.org> wrote:

> Hi Hadoop devs,
>
> I the past, Hadoop tends to be pretty far behind the latest versions of
> dependencies. Part of that is due to the fear of the breaking changes
> brought in by the dependency updates.
>
> However, things have changed dramatically over the past few years. With
> more focus on security vulnerabilities, more vulnerabilities are discovered
> in our dependencies, and users put more pressure on patching Hadoop (and
> its ecosystem) to use the latest dependency versions.
>
> As an example, Jackson-databind had 20 CVEs published in the last year
> alone.
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
>
> Jetty: 4 CVEs in 2019:
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
>
> We can no longer keep Hadoop stay behind. The more we stay behind, the
> harder it is to update. A good example is Jersey migration 1 -> 2
> HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984> contributed
> by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is hard.
> If any critical vulnerability is found in Jersey 1, it will leave us in a
> bad situation since we can't simply update Jersey version and be done.
>
> Hadoop 3 adds new public artifacts that shade these dependencies. We
> should advocate downstream applications to use the public artifacts to
> avoid breakage.
>
> I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> the latest dependency updates, or would rather stay behind to ensure
> compatibility?
>
> Coupled with that, I'd like to call for more frequent Hadoop releases for
> the same purpose. IMHO that'll require better infrastructure to assist the
> release work and some rethinking our current Hadoop code structure, like
> separate each subproject into its own repository and release cadence. This
> can be controversial but I think it'll be good for the project in the long
> run.
>
> Thanks,
> Wei-Chiu
>

Re: [DISCUSS] Accelerate Hadoop dependency updates

Reply via email to