That is unfortunately true. Now that I recognize the impact of guava update in Hadoop 3.1/3.2, how can we make this better for downstreamers to consume? Like I proposed, I think a middle ground is to shade guava in hadoop-thirdparty, and include the hadoop-thirdparty jar in the next Hadoop 3.1/3.2 release.
On Thu, Mar 12, 2020 at 12:03 AM Igor Dvorzhak <i...@google.com.invalid> wrote: > How do you manage and version such dependency upgrades in subminor > Haoop/Spark/Hive versions in Cloudera then? I would imagine that some > upgrades will be breaking for customers and can not be shipped in subminor > CDH release? Or this is in preparation for the next major/minor release of > CDH? > > On Wed, Mar 11, 2020 at 5:45 PM Wei-Chiu Chuang > <weic...@cloudera.com.invalid> wrote: > >> FWIW we are updating guava in Spark and Hive at Cloudera. Don't know which >> Apache version are they going to land, but we'll upstream them for sure. >> >> The guava change is debatable. It's not as critical as others. There are >> critical vulnerabilities in other dependencies that we have no way but to >> update to a new major/minor version because we are so far behind. And >> given >> the critical nature, I think it is worth the risk and backport to lower >> maintenance releases is warranted. Moreover, our minor releases are at >> best >> 1 per year. That is too slow to respond to a critical vulnerability. >> >> On Wed, Mar 11, 2020 at 5:02 PM Igor Dvorzhak <i...@google.com.invalid> >> wrote: >> >> > Generally I'm for updating dependencies, but I think that Hadoop should >> > stick with semantic versioning and do not make major and >> > minor dependency updates in subminor releases. >> > >> > For example, Hadoop 3.2.1 updated Guava to 27.0-jre, and because of >> this >> > Spark 3.0 stuck with Hadoop 3.2.0 - they use Hive 2.3.6 that doesn't >> > support Guava 27.0-jre. >> > >> > It would be better to make dependency upgrades when releasing new >> > major/minor versions, for example Guava 27.0-jre upgrade was more >> > appropriate for Hadoop 3.3.0 release than 3.2.1. >> > >> > On Tue, Mar 10, 2020 at 3:03 PM Wei-Chiu Chuang >> > <weic...@cloudera.com.invalid> wrote: >> > >> >> I'm not hearing any feedback so far, but I want to suggest: >> >> >> >> use hadoop-thirdparty repository to host any dependencies that are >> known >> >> to >> >> break compatibility. >> >> >> >> Candidate #1 guava >> >> Candidate #2 Netty >> >> Candidate #3 Jetty >> >> >> >> in fact, HBase shades these dependencies for the exact same reason. >> >> >> >> As an example of the cost of compatibility breakage: we spent the last >> 6 >> >> months to backport the guava update change (guava 11 --> 27) throughout >> >> Cloudera's stack, and after 6 months we are not done yet because we >> have >> >> to >> >> update guava in Hadoop, Hive, Spark ..., and Hadoop, Hive and Spark's >> >> guava >> >> is in the classpath of every application. >> >> >> >> Thoughts? >> >> >> >> On Sat, Mar 7, 2020 at 9:31 AM Wei-Chiu Chuang <weic...@apache.org> >> >> wrote: >> >> >> >> > Hi Hadoop devs, >> >> > >> >> > I the past, Hadoop tends to be pretty far behind the latest versions >> of >> >> > dependencies. Part of that is due to the fear of the breaking changes >> >> > brought in by the dependency updates. >> >> > >> >> > However, things have changed dramatically over the past few years. >> With >> >> > more focus on security vulnerabilities, more vulnerabilities are >> >> discovered >> >> > in our dependencies, and users put more pressure on patching Hadoop >> (and >> >> > its ecosystem) to use the latest dependency versions. >> >> > >> >> > As an example, Jackson-databind had 20 CVEs published in the last >> year >> >> > alone. >> >> > >> >> >> https://www.cvedetails.com/product/42991/Fasterxml-Jackson-databind.html?vendor_id=15866 >> >> > >> >> > Jetty: 4 CVEs in 2019: >> >> > >> >> >> https://www.cvedetails.com/product/34824/Eclipse-Jetty.html?vendor_id=10410 >> >> > >> >> > We can no longer keep Hadoop stay behind. The more we stay behind, >> the >> >> > harder it is to update. A good example is Jersey migration 1 -> 2 >> >> > HADOOP-15984 <https://issues.apache.org/jira/browse/HADOOP-15984> >> >> contributed >> >> > by Akira. Jersey 1 is no longer supported. But Jersey 2 migration is >> >> hard. >> >> > If any critical vulnerability is found in Jersey 1, it will leave us >> in >> >> a >> >> > bad situation since we can't simply update Jersey version and be >> done. >> >> > >> >> > Hadoop 3 adds new public artifacts that shade these dependencies. We >> >> > should advocate downstream applications to use the public artifacts >> to >> >> > avoid breakage. >> >> > >> >> > I'd like to hear your thoughts: are you okay to see Hadoop keep up >> with >> >> > the latest dependency updates, or would rather stay behind to ensure >> >> > compatibility? >> >> > >> >> > Coupled with that, I'd like to call for more frequent Hadoop releases >> >> for >> >> > the same purpose. IMHO that'll require better infrastructure to >> assist >> >> the >> >> > release work and some rethinking our current Hadoop code structure, >> like >> >> > separate each subproject into its own repository and release cadence. >> >> This >> >> > can be controversial but I think it'll be good for the project in the >> >> long >> >> > run. >> >> > >> >> > Thanks, >> >> > Wei-Chiu >> >> > >> >> >> > >> >