Re: [DISCUSS] Accelerate Hadoop dependency updates

Igor Dvorzhak Wed, 11 Mar 2020 17:03:05 -0700

Generally I'm for updating dependencies, but I think that Hadoop should
stick with semantic versioning and do not make major and
minor dependency updates in subminor releases.


For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
support Guava 27.0-jre.

It would be better to make dependency upgrades when releasing new
major/minor versions, for example Guava 27.0-jre upgrade was more
appropriate for Hadoop 3.3.0 release than 3.2.1.

On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
<weic...@cloudera.com.invalid> wrote:

> I'm not hearing any feedback so far, but I want to suggest:
>
> use hadoop-thirdparty repository to host any dependencies that are known to
> break compatibility.
>
> Candidate #1 guava
> Candidate #2 Netty
> Candidate #3 Jetty
>
> in fact, HBase shades these dependencies for the exact same reason.
>
> As an example of the cost of compatibility breakage: we spent the last 6
> months to backport the guava update change (guava 11 --> 27) throughout
> Cloudera's stack, and after 6 months we are not done yet because we have to
> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's guava
> is in the classpath of every application.
>
> Thoughts?
>
> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <weic...@apache.org> wrote:
>
> > Hi Hadoop devs,
> >
> > I the past, Hadoop tends to be pretty far behind the latest versions of
> > dependencies. Part of that is due to the fear of the breaking changes
> > brought in by the dependency updates.
> >
> > However, things have changed dramatically over the past few years. With
> > more focus on security vulnerabilities, more vulnerabilities are
> discovered
> > in our dependencies, and users put more pressure on patching Hadoop (and
> > its ecosystem) to use the latest dependency versions.
> >
> > As an example, Jackson-databind had 20 CVEs published in the last year
> > alone.
> >
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >
> > Jetty: 4 CVEs in 2019:
> >
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >
> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> > harder it is to update. A good example is Jersey migration 1 -> 2
> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> contributed
> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> hard.
> > If any critical vulnerability is found in Jersey 1, it will leave us in a
> > bad situation since we can't simply update Jersey version and be done.
> >
> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> > should advocate downstream applications to use the public artifacts to
> > avoid breakage.
> >
> > I'd like to hear your thoughts: are you okay to see Hadoop keep up with
> > the latest dependency updates, or would rather stay behind to ensure
> > compatibility?
> >
> > Coupled with that, I'd like to call for more frequent Hadoop releases for
> > the same purpose. IMHO that'll require better infrastructure to assist
> the
> > release work and some rethinking our current Hadoop code structure, like
> > separate each subproject into its own repository and release cadence.
> This
> > can be controversial but I think it'll be good for the project in the
> long
> > run.
> >
> > Thanks,
> > Wei-Chiu
> >
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [DISCUSS] Accelerate Hadoop dependency updates

Reply via email to