Re: [DISCUSS] Accelerate Hadoop dependency updates

Igor Dvorzhak Thu, 12 Mar 2020 00:03:26 -0700

How do you manage and version such dependency upgrades in subminor
Haoop/Spark/Hive versions in Cloudera then? I would imagine that some
upgrades will be breaking for customers and can not be shipped in subminor
CDH release? Or this is in preparation for the next major/minor release of
CDH?


On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang
<weic...@cloudera.com.invalid> wrote:

> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which
> Apache version are they going to land, but we'll upstream them for sure.
>
> The guava change is debatable. It's not as critical as others. There are
> critical vulnerabilities in other dependencies that we have no way but to
> update to a new major/minor version because we are so far behind. And given
> the critical nature, I think it is worth the risk and backport to lower
> maintenance releases is warranted. Moreover, our minor releases are at best
> 1 per year. That is too slow to respond to a critical vulnerability.
>
> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <i...@google.com.invalid>
> wrote:
>
> > Generally I'm for updating dependencies, but I think that Hadoop should
> > stick with semantic versioning and do not make major and
> > minor dependency updates in subminor releases.
> >
> > For  example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of this
> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't
> > support Guava 27.0-jre.
> >
> > It would be better to make dependency upgrades when releasing new
> > major/minor versions, for example Guava 27.0-jre upgrade was more
> > appropriate for Hadoop 3.3.0 release than 3.2.1.
> >
> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang
> > <weic...@cloudera.com.invalid> wrote:
> >
> >> I'm not hearing any feedback so far, but I want to suggest:
> >>
> >> use hadoop-thirdparty repository to host any dependencies that are known
> >> to
> >> break compatibility.
> >>
> >> Candidate #1 guava
> >> Candidate #2 Netty
> >> Candidate #3 Jetty
> >>
> >> in fact, HBase shades these dependencies for the exact same reason.
> >>
> >> As an example of the cost of compatibility breakage: we spent the last 6
> >> months to backport the guava update change (guava 11 --> 27) throughout
> >> Cloudera's stack, and after 6 months we are not done yet because we have
> >> to
> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's
> >> guava
> >> is in the classpath of every application.
> >>
> >> Thoughts?
> >>
> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <weic...@apache.org>
> >> wrote:
> >>
> >> > Hi Hadoop devs,
> >> >
> >> > I the past, Hadoop tends to be pretty far behind the latest versions
> of
> >> > dependencies. Part of that is due to the fear of the breaking changes
> >> > brought in by the dependency updates.
> >> >
> >> > However, things have changed dramatically over the past few years.
> With
> >> > more focus on security vulnerabilities, more vulnerabilities are
> >> discovered
> >> > in our dependencies, and users put more pressure on patching Hadoop
> (and
> >> > its ecosystem) to use the latest dependency versions.
> >> >
> >> > As an example, Jackson-databind had 20 CVEs published in the last year
> >> > alone.
> >> >
> >>
> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866
> >> >
> >> > Jetty: 4 CVEs in 2019:
> >> >
> >>
> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410
> >> >
> >> > We can no longer keep Hadoop stay behind. The more we stay behind, the
> >> > harder it is to update. A good example is Jersey migration 1 -> 2
> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984>
> >> contributed
> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is
> >> hard.
> >> > If any critical vulnerability is found in Jersey 1, it will leave us
> in
> >> a
> >> > bad situation since we can't simply update Jersey version and be done.
> >> >
> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We
> >> > should advocate downstream applications to use the public artifacts to
> >> > avoid breakage.
> >> >
> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up
> with
> >> > the latest dependency updates, or would rather stay behind to ensure
> >> > compatibility?
> >> >
> >> > Coupled with that, I'd like to call for more frequent Hadoop releases
> >> for
> >> > the same purpose. IMHO that'll require better infrastructure to assist
> >> the
> >> > release work and some rethinking our current Hadoop code structure,
> like
> >> > separate each subproject into its own repository and release cadence.
> >> This
> >> > can be controversial but I think it'll be good for the project in the
> >> long
> >> > run.
> >> >
> >> > Thanks,
> >> > Wei-Chiu
> >> >
> >>
> >
>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [DISCUSS] Accelerate Hadoop dependency updates

Reply via email to