@itai, I think pending the outcome of this discussion that it makes sense to have a wider community thread to announce any decisions we make here, thanks for bringing that up.
@rajiv, Minio support seems unrelated to this discussion. It seems like a reasonable request, but I recommend starting another thread to see if someone is interested in taking up this effort. @jihoon I definitely agree that Hadoop should be refactored to be an extension longer term. I don't think this upgrade would necessarily make doing such a refactor any easier, but not harder either. Just moving Hadoop to an extension also unfortunately doesn't really do anything to help our dependency problem though, which is the thing that has agitated me enough to start this thread and start looking into solutions. @will/@frank I feel like the stranglehold Hadoop has on our dependencies has started to become especially more painful in the last couple of years. Most painful to me is that we are stuck using a version of Apache Calcite from 2019 (six versions behind the latest), because newer versions require a newer version of Guava. This means we cannot get any bug fixes and improvements in our SQL parsing layer without doing something like packaging a shaded version of it ourselves or solving our Hadoop dependency problem. Many other dependencies have also proved problematic with Hadoop as well in the past, and since we aren't able to run the Hadoop integration tests in Travis, there is always the chance that sometimes we don't catch these when they go in. I imagine now that we have turned on dependabot this week, https://github.com/apache/druid/pull/11079, that we are going to have to proceed very carefully with it until we are able to resolve this dependency issue. Hadoop 3.3.0 is also the first to support running on a Java version that is newer than Java 8 per https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions, which is another area we have been working towards - Druid to officially support Java 11+ environments. I'm sort of at a loss of what else to do besides one of - switching to these Hadoop 3 shaded jars and dropping 2.x support - figuring out how to custom package our own Hadoop 2.x dependendencies that are shaded similarly to the Hadoop 3 client jars, and only supporting Hadoop with application classpath isolation (mapreduce.job.classloader = true) - just dropping support for Hadoop completely I would much rather devote all effort into making Druids native batch ingestion better to encourage people to migrate to that, than continuing to fight with figuring out how to keep supporting Hadoop, so upgrading and switching to the shaded client jars at least seemed like a reasonable compromise to dropping it completely. Maybe making custom shaded Hadoop dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard as I am imagining, but it does seem like the most amount of work between the solutions I could think of to potentially resolve this problem. Does anyone have any other ideas of how we can isolate our dependencies from Hadoop? Solutions like shading Guava, https://github.com/apache/druid/pull/10964, would let Druid itself use newer Guava, but that doesn't help conflicts within our dependencies which has always seemed to be the larger problem to me. Moving Hadoop support to an extension doesn't help anything unless we can ensure that we can run Druid ingestion tasks on Hadoop without having to match all of the Hadoop clusters dependencies with some sort of classloader wizardry. Maybe we could consider keeping a 0.22.x release line in Druid that gets security and minor bug fixes for some period of time to give people a longer period to migrate off of Hadoop 2.x? I can't speak for the rest of the committers, but I would personally be more open to maintaining such a branch if it meant that moving forward at least we could update all of our dependencies to newer versions, while providing a transition path to still have at least some support until migrating to Hadoop 3 or native Druid batch ingestion. Any other ideas? On Tue, Jun 8, 2021 at 7:44 PM frank chen <frankc...@apache.org> wrote: > Considering Druid takes advantage of lots of external components to work, I > think we should upgrade Druid in a little bit conservitive way. Dropping > support of hadoop2 is not a good idea. > The upgrading of the ZooKeeper client in Druid also prevents me from > adopting 0.22 for a longer time. > > Although users could upgrade these dependencies first to use the latest > Druid releases, frankly speaking, these upgrades are not so easy in > production and usually take longer time, which would prevent users from > experiencing new features of Druid. > For hadoop3, I have heard of some performance issues, which also makes me > have no confidence to upgrade. > > I think what Jihoon proposes is a good idea, separating hadoop2 from Druid > core as an extension. > Since hadoop2 has not been EOF, to achieve balance between compatibility > and long term evolution, maybe we could provide two extensions, one for > hadoop2, one for hadoop3. > > > > Will Lauer <wla...@verizonmedia.com.invalid> 于2021年6月9日周三 上午4:13写道: > > > Just to follow up on this, our main problem with hadoop3 right now has > been > > instability in HDFS, to the extent that we put on hold any plans to > deploy > > it to our production systems. I would claim Hadoop3 isn't mature enough > yet > > to consider migrating Druid to it. > > > > WIll > > > > <http://www.verizonmedia.com> > > > > Will Lauer > > > > Senior Principal Architect, Audience & Advertising Reporting > > Data Platforms & Systems Engineering > > > > M 508 561 6427 > > 1908 S. First St > > Champaign, IL 61822 > > > > <http://www.facebook.com/verizonmedia> < > http://twitter.com/verizonmedia> > > <https://www.linkedin.com/company/verizon-media/> > > <http://www.instagram.com/verizonmedia> > > > > > > > > On Tue, Jun 8, 2021 at 2:59 PM Will Lauer <wla...@verizonmedia.com> > wrote: > > > > > Unfortunately, the migration off of hadoop3 is a hard one (maybe not > for > > > Druid, but certainly for big organizations running large hadoop2 > > > workloads). If druid migrated to hadoop3 after 0.22, that would > probably > > > prevent me from taking any new versions of Druid for at least the > > remainder > > > of the year and possibly longer. > > > > > > Will > > > > > > > > > <http://www.verizonmedia.com> > > > > > > Will Lauer > > > > > > Senior Principal Architect, Audience & Advertising Reporting > > > Data Platforms & Systems Engineering > > > > > > M 508 561 6427 > > > 1908 S. First St > > > Champaign, IL 61822 > > > > > > <http://www.facebook.com/verizonmedia> < > > http://twitter.com/verizonmedia> > > > <https://www.linkedin.com/company/verizon-media/> > > > <http://www.instagram.com/verizonmedia> > > > > > > > > > > > > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie <cwy...@apache.org> wrote: > > > > > >> Hi all, > > >> > > >> I've been assisting with some experiments to see how we might want to > > >> migrate Druid to support Hadoop 3.x, and more importantly, see if > maybe > > we > > >> can finally be free of some of the dependency issues it has been > causing > > >> for as long as I can remember working with Druid. > > >> > > >> Hadoop 3 introduced shaded client jars, > > >> > > >> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo&e= > > >> , with the purpose to > > >> allow applications to talk to the Hadoop cluster without drowning in > its > > >> transitive dependencies. The experimental branch that I have been > > helping > > >> with, which is using these new shaded client jars, can be seen in this > > PR > > >> > > >> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=ULseRJUsY5gTBgFA9-BUxg&m=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ&s=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0&e= > > >> , and is currently working with > > >> the HDFS integration tests as well as the Hadoop tutorial flow in the > > >> Druid > > >> docs (which is pretty much equivalent to the HDFS integration test). > > >> > > >> The cloud deep storages still need some further testing and some minor > > >> cleanup still needs done for the docs and such. Additionally we still > > need > > >> to figure out how to handle the Kerberos extension, because it extends > > >> some > > >> Hadoop classes so isn't able to use the shaded client jars in a > > >> straight-forward manner, and so still has heavy dependencies and > hasn't > > >> been tested. However, the experiment has started to pan out enough to > > >> where > > >> I think it is worth starting this discussion, because it does have > some > > >> implications. > > >> > > >> Making this change I think will allow us to update our dependencies > > with a > > >> lot more freedom (I'm looking at you, Guava), but the catch is that > once > > >> we > > >> make this change and start updating these dependencies, it will become > > >> hard, nearing impossible to support Hadoop 2.x, since as far as I know > > >> there isn't an equivalent set of shaded client jars. I am also not > > certain > > >> how far back the Hadoop job classpath isolation stuff goes > > >> (mapreduce.job.classloader = true) which I think is required to be set > > on > > >> Druid tasks for this shaded stuff to work alongside updated Druid > > >> dependencies. > > >> > > >> Is anyone opposed to or worried about dropping Hadoop 2.x support > after > > >> the > > >> Druid 0.22 release? > > >> > > > > > >