Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

2021-06-08 Thread Clint Wylie
@itai, I think pending the outcome of this discussion that it makes sense
to have a wider community thread to announce any decisions we make here,
thanks for bringing that up.

@rajiv, Minio support seems unrelated to this discussion. It seems like a
reasonable request, but I recommend starting another thread to see if
someone is interested in taking up this effort.

@jihoon I definitely agree that Hadoop should be refactored to be an
extension longer term. I don't think this upgrade would necessarily
make doing such a refactor any easier, but not harder either. Just moving
Hadoop to an extension also unfortunately doesn't really do anything to
help our dependency problem though, which is the thing that has agitated me
enough to start this thread and start looking into solutions.

@will/@frank I feel like the stranglehold Hadoop has on our dependencies
has started to become especially more painful in the last couple of
years. Most painful to me is that we are stuck using a version of Apache
Calcite from 2019 (six versions behind the latest), because newer versions
require a newer version of Guava. This means we cannot get any bug fixes
and improvements in our SQL parsing layer without doing something like
packaging a shaded version of it ourselves or solving our Hadoop dependency
problem.

Many other dependencies have also proved problematic with Hadoop as well in
the past, and since we aren't able to run the Hadoop integration tests in
Travis, there is always the chance that sometimes we don't catch these when
they go in. I imagine now that we have turned on dependabot this week,
https://github.com/apache/druid/pull/11079, that we are going to have to
proceed very carefully with it until we are able to resolve this dependency
issue.

Hadoop 3.3.0 is also the first to support running on a Java version that is
newer than Java 8 per
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions,
which is another area we have been working towards - Druid to officially
support Java 11+ environments.

I'm sort of at a loss of what else to do besides one of
- switching to these Hadoop 3 shaded jars and dropping 2.x support
- figuring out how to custom package our own Hadoop 2.x dependendencies
that are shaded similarly to the Hadoop 3 client jars, and only supporting
Hadoop with application classpath isolation (mapreduce.job.classloader =
true)
- just dropping support for Hadoop completely

I would much rather devote all effort into making Druids native batch
ingestion better to encourage people to migrate to that, than continuing to
fight with figuring out how to keep supporting Hadoop, so upgrading and
switching to the shaded client jars at least seemed like a reasonable
compromise to dropping it completely. Maybe making custom shaded Hadoop
dependencies in the spirit of the Hadoop 3 shaded jars isn't as hard as I
am imagining, but it does seem like the most amount of work between the
solutions I could think of to potentially resolve this problem.

Does anyone have any other ideas of how we can isolate our dependencies
from Hadoop? Solutions like shading Guava,
https://github.com/apache/druid/pull/10964, would let Druid itself use
newer Guava, but that doesn't help conflicts within our dependencies which
has always seemed to be the larger problem to me. Moving Hadoop support to
an extension doesn't help anything unless we can ensure that we can run
Druid ingestion tasks on Hadoop without having to match all of the Hadoop
clusters dependencies with some sort of classloader wizardry.

Maybe we could consider keeping a 0.22.x release line in Druid that gets
security and minor bug fixes for some period of time to give people a
longer period to migrate off of Hadoop 2.x? I can't speak for the rest of
the committers, but I would personally be more open to maintaining such a
branch if it meant that moving forward at least we could update all of our
dependencies to newer versions, while providing a transition path to still
have at least some support until migrating to Hadoop 3 or native Druid
batch ingestion.

Any other ideas?



On Tue, Jun 8, 2021 at 7:44 PM frank chen  wrote:

> Considering Druid takes advantage of lots of external components to work, I
> think we should upgrade Druid in a little bit conservitive way. Dropping
> support of hadoop2 is not a good idea.
> The upgrading of the ZooKeeper client in Druid also prevents me from
> adopting 0.22 for a longer time.
>
> Although users could upgrade these dependencies first to use the latest
> Druid releases, frankly speaking, these upgrades are not so easy in
> production and usually take longer time, which would prevent users from
> experiencing new features of Druid.
> For hadoop3, I have heard of some performance issues, which also makes me
> have no confidence to upgrade.
>
> I think what Jihoon proposes is a good idea, separating hadoop2 from Druid
> core as an extension.
> Since hadoop2 has not been EOF, to achieve balance between compatibility

Re: [VOTE] Release Apache Druid 0.21.1 [RC2]

2021-06-08 Thread Clint Wylie
This vote has passed, the final results can be seen in this thread:
https://lists.apache.org/thread.html/r3c7db826cdf9025efa2f3906e4f0d2ff69b66ae4e3513212a0bad2e3%40%3Cdev.druid.apache.org%3E

On Tue, Jun 8, 2021 at 5:38 PM Jonathan Wei  wrote:

> +1 (binding)
> src
> - verified signature/checksum
> - LICENSE/NOTICE present
> - ran RAT check
> - ran unit tests
> - built binary and ran ingestion tutorial and a few queries
>
> bin
> - verified signature/checksum
> - LICENSE/NOTICE present
> - ran ingestion tutorial and a few queries
>
> docker
> - built docker image from source on linux
> - ran docker-compose quickstart cluster on linux, ran ingestion tutorial
> and a few queries
>
> On Sat, Jun 5, 2021 at 2:01 PM Jihoon Son  wrote:
>
> > +1 (binding)
> >
> > src
> > - verified the signature and checksum
> > - LICENSE and NOTICE are present
> > - compiled and ran the license check and unit tests
> > - built binary, ingested some data via batch and kafka ingestion, and
> > ran some queries
> >
> > bin
> > - verified the signature and checksum
> > - LICENSE and NOTICE are present
> > - ingested some data via batch and kafka ingestion and ran some queries
> >
> > docker
> > - verified checksum
> > - ran a cluster after cleaning up existing volumes on linux
> > - ingested some data via batch ingestion and ran some queries
> >
> > On Thu, Jun 3, 2021 at 10:43 PM frank chen  wrote:
> > >
> > > +1
> > >
> > > src
> > > - verified .asc signatures and .sha512 checksums
> > > - LICENSE/NOTICE present
> > > - compiled, ran the license check/unit tests
> > > - started nano-quickstart cluster and ran both native ingestion and
> Kafka
> > > ingestion followed by some basic queries
> > >
> > > bin
> > > - verified .asc signatures and .sha512 checksums verified
> > > - LICENSE/NOTICE present
> > > - started nano-quickstart cluster and ran both native ingestion and
> Kafka
> > > ingestion
> > >
> > > docker
> > > - started docker on Linux(CentOS) and ran both native ingestion and
> Kafka
> > > ingestion followed by some basic queries
> > > - started docker on macOS and ran both native ingestion and Kafka
> > ingestion
> > > followed by some basic queries
> > >
> > >
> > > Clint Wylie  于2021年6月3日周四 下午5:00写道:
> > >
> > > > Hi all,
> > > >
> > > > I have created a build for Apache Druid 0.21.1, release
> > > > candidate 2.
> > > >
> > > > Thanks to everyone who has helped contribute to the release! You can
> > read
> > > > the proposed release notes here:
> > > > https://github.com/apache/druid/issues/11249
> > > >
> > > > The release candidate has been tagged in GitHub as
> > > > druid-0.21.1-rc2 (6ba6b16786eca5a25ebad2f00df4b2a265861b01),
> > > > available here:
> > > > https://github.com/apache/druid/releases/tag/druid-0.21.1-rc2
> > > >
> > > > The artifacts to be voted on are located here:
> > > > https://dist.apache.org/repos/dist/dev/druid/0.21.1-rc2/
> > > >
> > > > A staged Maven repository is available for review at:
> > > >
> > https://repository.apache.org/content/repositories/orgapachedruid-1025/
> > > >
> > > > Staged druid.apache.org website documentation is available here:
> > > > https://druid.staged.apache.org/docs/0.21.1/design/index.html
> > > >
> > > > A Docker image containing the binary of the release candidate can be
> > > > retrieved via:
> > > > docker pull apache/druid:0.21.1-rc2
> > > >
> > > > artifact checksums
> > > > src:
> > > >
> > > >
> >
> 65eff0c302c316afbf4a84c61a9f54a1baf5ff3cd1baf390d9034665f7e4fbc7457e108c8c5a4ac66350ee69d922e5bcdacde14abac4b54398378b99022acd16
> > > > bin:
> > > >
> > > >
> >
> a68c63ddf92a0939315bd8b79fbdd5fae712e888d1a0e0466f0a87a8e8f1270d70fc5dcc0ec64e076391576be29da1cce88f7c5cf725c31418601e4eee8fa354
> > > > docker:
> > a3a301693db0eea5af1278535586e7f9768bd497231d6748f35a983962780371
> > > >
> > > > Release artifacts are signed with the following key:
> > > > https://people.apache.org/keys/committer/cwylie.asc
> > > >
> > > > This key and the key of other committers can also be found in the
> > project's
> > > > KEYS file here:
> > > > https://dist.apache.org/repos/dist/release/druid/KEYS
> > > >
> > > > (If you are a committer, please feel free to add your own key to that
> > file
> > > > by following the instructions in the file's header.)
> > > >
> > > >
> > > > Verify checksums:
> > > > diff <(shasum -a512 apache-druid-0.21.1-src.tar.gz | \
> > > > cut -d ' ' -f1) \
> > > > <(cat apache-druid-0.21.1-src.tar.gz.sha512 ; echo)
> > > >
> > > > diff <(shasum -a512 apache-druid-0.21.1-bin.tar.gz | \
> > > > cut -d ' ' -f1) \
> > > > <(cat apache-druid-0.21.1-bin.tar.gz.sha512 ; echo)
> > > >
> > > > Verify signatures:
> > > > gpg --verify apache-druid-0.21.1-src.tar.gz.asc \
> > > > apache-druid-0.21.1-src.tar.gz
> > > >
> > > > gpg --verify apache-druid-0.21.1-bin.tar.gz.asc \
> > > > apache-druid-0.21.1-bin.tar.gz
> > > >
> > > > Please review the proposed artifacts and vote. Note that Apache has
> > > > specific requirements that must be met before +1 binding 

[RESULT] [VOTE] Release Apache Druid 0.21.1 [RC2]

2021-06-08 Thread Clint Wylie
Thanks to everyone who participated in the vote! The vote has passed with 3
binding +1s and 1 non-binding +1.

Clint Wylie: +1 (binding)
Frank Chen: +1 (non-binding)
Jihoon Son: +1 (binding)
Jon Wei: +1 (binding)


Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

2021-06-08 Thread frank chen
Considering Druid takes advantage of lots of external components to work, I
think we should upgrade Druid in a little bit conservitive way. Dropping
support of hadoop2 is not a good idea.
The upgrading of the ZooKeeper client in Druid also prevents me from
adopting 0.22 for a longer time.

Although users could upgrade these dependencies first to use the latest
Druid releases, frankly speaking, these upgrades are not so easy in
production and usually take longer time, which would prevent users from
experiencing new features of Druid.
For hadoop3, I have heard of some performance issues, which also makes me
have no confidence to upgrade.

I think what Jihoon proposes is a good idea, separating hadoop2 from Druid
core as an extension.
Since hadoop2 has not been EOF, to achieve balance between compatibility
and long term evolution, maybe we could provide two extensions, one for
hadoop2, one for hadoop3.



Will Lauer  于2021年6月9日周三 上午4:13写道:

> Just to follow up on this, our main problem with hadoop3 right now has been
> instability in HDFS, to the extent that we put on hold any plans to deploy
> it to our production systems. I would claim Hadoop3 isn't mature enough yet
> to consider migrating Druid to it.
>
> WIll
>
> 
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
>    
> 
> 
>
>
>
> On Tue, Jun 8, 2021 at 2:59 PM Will Lauer  wrote:
>
> > Unfortunately, the migration off of hadoop3 is a hard one (maybe not for
> > Druid, but certainly for big organizations running large hadoop2
> > workloads). If druid migrated to hadoop3 after 0.22, that would probably
> > prevent me from taking any new versions of Druid for at least the
> remainder
> > of the year and possibly longer.
> >
> > Will
> >
> >
> > 
> >
> > Will Lauer
> >
> > Senior Principal Architect, Audience & Advertising Reporting
> > Data Platforms & Systems Engineering
> >
> > M 508 561 6427
> > 1908 S. First St
> > Champaign, IL 61822
> >
> >    <
> http://twitter.com/verizonmedia>
> >
> > 
> >
> >
> >
> > On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie  wrote:
> >
> >> Hi all,
> >>
> >> I've been assisting with some experiments to see how we might want to
> >> migrate Druid to support Hadoop 3.x, and more importantly, see if maybe
> we
> >> can finally be free of some of the dependency issues it has been causing
> >> for as long as I can remember working with Druid.
> >>
> >> Hadoop 3 introduced shaded client jars,
> >>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804=DwIBaQ=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY=ULseRJUsY5gTBgFA9-BUxg=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo=
> >> , with the purpose to
> >> allow applications to talk to the Hadoop cluster without drowning in its
> >> transitive dependencies. The experimental branch that I have been
> helping
> >> with, which is using these new shaded client jars, can be seen in this
> PR
> >>
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314=DwIBaQ=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY=ULseRJUsY5gTBgFA9-BUxg=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0=
> >> , and is currently working with
> >> the HDFS integration tests as well as the Hadoop tutorial flow in the
> >> Druid
> >> docs (which is pretty much equivalent to the HDFS integration test).
> >>
> >> The cloud deep storages still need some further testing and some minor
> >> cleanup still needs done for the docs and such. Additionally we still
> need
> >> to figure out how to handle the Kerberos extension, because it extends
> >> some
> >> Hadoop classes so isn't able to use the shaded client jars in a
> >> straight-forward manner, and so still has heavy dependencies and hasn't
> >> been tested. However, the experiment has started to pan out enough to
> >> where
> >> I think it is worth starting this discussion, because it does have some
> >> implications.
> >>
> >> Making this change I think will allow us to update our dependencies
> with a
> >> lot more freedom (I'm looking at you, Guava), but the catch is that once
> >> we
> >> make this change and start updating these dependencies, it will become
> >> hard, nearing impossible to support Hadoop 2.x, since as far as I know
> >> there isn't an equivalent set of shaded client jars. I am also not
> certain
> >> how far back the Hadoop job classpath isolation stuff goes
> >> (mapreduce.job.classloader = true) which I think is 

Re: [VOTE] Release Apache Druid 0.21.1 [RC2]

2021-06-08 Thread Jonathan Wei
+1 (binding)
src
- verified signature/checksum
- LICENSE/NOTICE present
- ran RAT check
- ran unit tests
- built binary and ran ingestion tutorial and a few queries

bin
- verified signature/checksum
- LICENSE/NOTICE present
- ran ingestion tutorial and a few queries

docker
- built docker image from source on linux
- ran docker-compose quickstart cluster on linux, ran ingestion tutorial
and a few queries

On Sat, Jun 5, 2021 at 2:01 PM Jihoon Son  wrote:

> +1 (binding)
>
> src
> - verified the signature and checksum
> - LICENSE and NOTICE are present
> - compiled and ran the license check and unit tests
> - built binary, ingested some data via batch and kafka ingestion, and
> ran some queries
>
> bin
> - verified the signature and checksum
> - LICENSE and NOTICE are present
> - ingested some data via batch and kafka ingestion and ran some queries
>
> docker
> - verified checksum
> - ran a cluster after cleaning up existing volumes on linux
> - ingested some data via batch ingestion and ran some queries
>
> On Thu, Jun 3, 2021 at 10:43 PM frank chen  wrote:
> >
> > +1
> >
> > src
> > - verified .asc signatures and .sha512 checksums
> > - LICENSE/NOTICE present
> > - compiled, ran the license check/unit tests
> > - started nano-quickstart cluster and ran both native ingestion and Kafka
> > ingestion followed by some basic queries
> >
> > bin
> > - verified .asc signatures and .sha512 checksums verified
> > - LICENSE/NOTICE present
> > - started nano-quickstart cluster and ran both native ingestion and Kafka
> > ingestion
> >
> > docker
> > - started docker on Linux(CentOS) and ran both native ingestion and Kafka
> > ingestion followed by some basic queries
> > - started docker on macOS and ran both native ingestion and Kafka
> ingestion
> > followed by some basic queries
> >
> >
> > Clint Wylie  于2021年6月3日周四 下午5:00写道:
> >
> > > Hi all,
> > >
> > > I have created a build for Apache Druid 0.21.1, release
> > > candidate 2.
> > >
> > > Thanks to everyone who has helped contribute to the release! You can
> read
> > > the proposed release notes here:
> > > https://github.com/apache/druid/issues/11249
> > >
> > > The release candidate has been tagged in GitHub as
> > > druid-0.21.1-rc2 (6ba6b16786eca5a25ebad2f00df4b2a265861b01),
> > > available here:
> > > https://github.com/apache/druid/releases/tag/druid-0.21.1-rc2
> > >
> > > The artifacts to be voted on are located here:
> > > https://dist.apache.org/repos/dist/dev/druid/0.21.1-rc2/
> > >
> > > A staged Maven repository is available for review at:
> > >
> https://repository.apache.org/content/repositories/orgapachedruid-1025/
> > >
> > > Staged druid.apache.org website documentation is available here:
> > > https://druid.staged.apache.org/docs/0.21.1/design/index.html
> > >
> > > A Docker image containing the binary of the release candidate can be
> > > retrieved via:
> > > docker pull apache/druid:0.21.1-rc2
> > >
> > > artifact checksums
> > > src:
> > >
> > >
> 65eff0c302c316afbf4a84c61a9f54a1baf5ff3cd1baf390d9034665f7e4fbc7457e108c8c5a4ac66350ee69d922e5bcdacde14abac4b54398378b99022acd16
> > > bin:
> > >
> > >
> a68c63ddf92a0939315bd8b79fbdd5fae712e888d1a0e0466f0a87a8e8f1270d70fc5dcc0ec64e076391576be29da1cce88f7c5cf725c31418601e4eee8fa354
> > > docker:
> a3a301693db0eea5af1278535586e7f9768bd497231d6748f35a983962780371
> > >
> > > Release artifacts are signed with the following key:
> > > https://people.apache.org/keys/committer/cwylie.asc
> > >
> > > This key and the key of other committers can also be found in the
> project's
> > > KEYS file here:
> > > https://dist.apache.org/repos/dist/release/druid/KEYS
> > >
> > > (If you are a committer, please feel free to add your own key to that
> file
> > > by following the instructions in the file's header.)
> > >
> > >
> > > Verify checksums:
> > > diff <(shasum -a512 apache-druid-0.21.1-src.tar.gz | \
> > > cut -d ' ' -f1) \
> > > <(cat apache-druid-0.21.1-src.tar.gz.sha512 ; echo)
> > >
> > > diff <(shasum -a512 apache-druid-0.21.1-bin.tar.gz | \
> > > cut -d ' ' -f1) \
> > > <(cat apache-druid-0.21.1-bin.tar.gz.sha512 ; echo)
> > >
> > > Verify signatures:
> > > gpg --verify apache-druid-0.21.1-src.tar.gz.asc \
> > > apache-druid-0.21.1-src.tar.gz
> > >
> > > gpg --verify apache-druid-0.21.1-bin.tar.gz.asc \
> > > apache-druid-0.21.1-bin.tar.gz
> > >
> > > Please review the proposed artifacts and vote. Note that Apache has
> > > specific requirements that must be met before +1 binding votes can be
> cast
> > > by PMC members. Please refer to the policy at
> > > http://www.apache.org/legal/release-policy.html#policy for more
> details.
> > >
> > > As part of the validation process, the release artifacts can be
> generated
> > > from source by running:
> > > mvn clean install -Papache-release,dist -Dgpg.skip
> > >
> > > The RAT license check can be run from source by:
> > > mvn apache-rat:check -Prat
> > >
> > > This vote will be open for at least 72 hours. The vote will pass if a
> > > majority of at least 

Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

2021-06-08 Thread Will Lauer
Just to follow up on this, our main problem with hadoop3 right now has been
instability in HDFS, to the extent that we put on hold any plans to deploy
it to our production systems. I would claim Hadoop3 isn't mature enough yet
to consider migrating Druid to it.

WIll



Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

   





On Tue, Jun 8, 2021 at 2:59 PM Will Lauer  wrote:

> Unfortunately, the migration off of hadoop3 is a hard one (maybe not for
> Druid, but certainly for big organizations running large hadoop2
> workloads). If druid migrated to hadoop3 after 0.22, that would probably
> prevent me from taking any new versions of Druid for at least the remainder
> of the year and possibly longer.
>
> Will
>
>
> 
>
> Will Lauer
>
> Senior Principal Architect, Audience & Advertising Reporting
> Data Platforms & Systems Engineering
>
> M 508 561 6427
> 1908 S. First St
> Champaign, IL 61822
>
>    
>
> 
>
>
>
> On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie  wrote:
>
>> Hi all,
>>
>> I've been assisting with some experiments to see how we might want to
>> migrate Druid to support Hadoop 3.x, and more importantly, see if maybe we
>> can finally be free of some of the dependency issues it has been causing
>> for as long as I can remember working with Druid.
>>
>> Hadoop 3 introduced shaded client jars,
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804=DwIBaQ=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY=ULseRJUsY5gTBgFA9-BUxg=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo=
>> , with the purpose to
>> allow applications to talk to the Hadoop cluster without drowning in its
>> transitive dependencies. The experimental branch that I have been helping
>> with, which is using these new shaded client jars, can be seen in this PR
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314=DwIBaQ=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY=ULseRJUsY5gTBgFA9-BUxg=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0=
>> , and is currently working with
>> the HDFS integration tests as well as the Hadoop tutorial flow in the
>> Druid
>> docs (which is pretty much equivalent to the HDFS integration test).
>>
>> The cloud deep storages still need some further testing and some minor
>> cleanup still needs done for the docs and such. Additionally we still need
>> to figure out how to handle the Kerberos extension, because it extends
>> some
>> Hadoop classes so isn't able to use the shaded client jars in a
>> straight-forward manner, and so still has heavy dependencies and hasn't
>> been tested. However, the experiment has started to pan out enough to
>> where
>> I think it is worth starting this discussion, because it does have some
>> implications.
>>
>> Making this change I think will allow us to update our dependencies with a
>> lot more freedom (I'm looking at you, Guava), but the catch is that once
>> we
>> make this change and start updating these dependencies, it will become
>> hard, nearing impossible to support Hadoop 2.x, since as far as I know
>> there isn't an equivalent set of shaded client jars. I am also not certain
>> how far back the Hadoop job classpath isolation stuff goes
>> (mapreduce.job.classloader = true) which I think is required to be set on
>> Druid tasks for this shaded stuff to work alongside updated Druid
>> dependencies.
>>
>> Is anyone opposed to or worried about dropping Hadoop 2.x support after
>> the
>> Druid 0.22 release?
>>
>


Re: [E] [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

2021-06-08 Thread Will Lauer
Unfortunately, the migration off of hadoop3 is a hard one (maybe not for
Druid, but certainly for big organizations running large hadoop2
workloads). If druid migrated to hadoop3 after 0.22, that would probably
prevent me from taking any new versions of Druid for at least the remainder
of the year and possibly longer.

Will




Will Lauer

Senior Principal Architect, Audience & Advertising Reporting
Data Platforms & Systems Engineering

M 508 561 6427
1908 S. First St
Champaign, IL 61822

   





On Tue, Jun 8, 2021 at 3:08 AM Clint Wylie  wrote:

> Hi all,
>
> I've been assisting with some experiments to see how we might want to
> migrate Druid to support Hadoop 3.x, and more importantly, see if maybe we
> can finally be free of some of the dependency issues it has been causing
> for as long as I can remember working with Druid.
>
> Hadoop 3 introduced shaded client jars,
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HADOOP-2D11804=DwIBaQ=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY=ULseRJUsY5gTBgFA9-BUxg=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ=rBnEOMf2IKDMeWUo4TZyqf5CzrnbiYTfZUkjHr8GOHo=
> , with the purpose to
> allow applications to talk to the Hadoop cluster without drowning in its
> transitive dependencies. The experimental branch that I have been helping
> with, which is using these new shaded client jars, can be seen in this PR
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_druid_pull_11314=DwIBaQ=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY=ULseRJUsY5gTBgFA9-BUxg=FRw8adGvb_qAPLtFgQWNJywJiOgU8zgfkkXf_nokPKQ=424doHggbejAz5XswosgVkJK98VUBcUj0pD5bAcBjT0=
> , and is currently working with
> the HDFS integration tests as well as the Hadoop tutorial flow in the Druid
> docs (which is pretty much equivalent to the HDFS integration test).
>
> The cloud deep storages still need some further testing and some minor
> cleanup still needs done for the docs and such. Additionally we still need
> to figure out how to handle the Kerberos extension, because it extends some
> Hadoop classes so isn't able to use the shaded client jars in a
> straight-forward manner, and so still has heavy dependencies and hasn't
> been tested. However, the experiment has started to pan out enough to where
> I think it is worth starting this discussion, because it does have some
> implications.
>
> Making this change I think will allow us to update our dependencies with a
> lot more freedom (I'm looking at you, Guava), but the catch is that once we
> make this change and start updating these dependencies, it will become
> hard, nearing impossible to support Hadoop 2.x, since as far as I know
> there isn't an equivalent set of shaded client jars. I am also not certain
> how far back the Hadoop job classpath isolation stuff goes
> (mapreduce.job.classloader = true) which I think is required to be set on
> Druid tasks for this shaded stuff to work alongside updated Druid
> dependencies.
>
> Is anyone opposed to or worried about dropping Hadoop 2.x support after the
> Druid 0.22 release?
>


Re: Enabling dependabot in our github repository

2021-06-08 Thread Julian Hyde
I agree that PRs should not be committed immediately and unconditionally when 
the dependabot finds them. But if we defer, there is a concern that good PRs 
will be forgotten. How about making a particular person (say the release 
manager) or triggering event (say voting on an RC) responsible for checking all 
applicable PRs have been applied?

> On Jun 8, 2021, at 6:58 AM, Gian Merlino  wrote:
> 
> Here's a running list of PRs opened by the dependabot:
> https://github.com/apache/druid/pulls?q=is%3Apr+author%3Aapp%2Fdependabot
> 
> On Mon, Jun 7, 2021 at 12:22 PM Gian Merlino  wrote:
> 
>> There's been some extra discussion this PR:
>> https://github.com/apache/druid/pull/11079
>> 
>> I just +1'ed it, but I wanted to come back here to say that IMO, we should
>> avoid getting in the habit of blindly applying these updates without
>> testing. There's been lots of situations in the past where a
>> harmless-looking dependency upgrade broke something. Sometimes the new
>> dependency version had a regression in it, and sometimes even without
>> regressions it can introduce compatibility problems.
>> 
>> So, I think it'd be good to apply the updates when we're confident in our
>> ability to test them, and add ignores (or tests!) for the rest.
>> 
>> On Thu, Apr 8, 2021 at 12:35 PM Xavier Léauté 
>> wrote:
>> 
>>> Thanks Maytas, I asked in that thread. They seemed concerned about write
>>> access requested by dependabot,
>>> but that should no longer be required as far as I can tell, now that it is
>>> natively integrated into GitHub.
>>> It should only be a matter of adding the config file to the repo, similar
>>> to what we do to automate closing stale issues / PR.
>>> 
>>> On Tue, Apr 6, 2021 at 2:50 PM Maytas Monsereenusorn 
>>> wrote:
>>> 
 I remember seeing someone asked about Dependabot in asfinfra slack
>>> channel
 a few weeks ago. However, asfinfra said they cannot allow it.
 Here is the link:
 https://the-asf.slack.com/archives/CBX4TSBQ8/p1616539376210800
 I think this is the same as Github's dependabot.
 
 Best Regards,
 Maytas
 
 
 On Tue, Apr 6, 2021 at 2:37 PM Xavier Léauté  wrote:
 
> Hi folks, as you know Druid has a lot of dependencies, and keeping up
 with
> the latest versions of everything, whether it relates to fixing CVEs
>>> or
> other improvements is a lot of manual work.
> 
> I suggest we enable Github's dependabot in our repository to keep our
> dependencies up to date. The bot is also helpful in providing a short
> commit log summary to understand changes.
> This might yield a flurry of PRs initially, but we can configure it to
> exclude libraries or version ranges that we know are unsafe for us to
> upgrade to.
> 
> It looks like some other ASF repos have this enabled already (see
> https://github.com/apache/commons-imaging/pull/126), so hopefully
>>> this
> only
> requires filing an INFRA ticket.
> 
> Happy to take care of it if folks are on board.
> 
> Thanks!
> Xavier
> 
 
>>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

2021-06-08 Thread Jihoon Son
Clint, thank you for starting this thread. I love the idea of dropping
support for Hadoop 2.x. The shaded jars will definitely help us
upgrade our rusty dependencies.
Another problem with hadoop is that the hadoop ingestion lives in the
Druid core today, not in a separate extension. Longer term, we should
separate it completely from the Druid core and make it an extension.
I'm wondering if dropping support for 2.x does anything with it (like
this change makes it easier or harder?). Have you thought about it?

On Tue, Jun 8, 2021 at 2:25 AM Itai Yaffe  wrote:
>
> Hey Clint,
> I think it's definitely a step in the right direction.
> One thing I would suggest, since the are several deployments using Hadoop
> (either for deep storage and/or for ingestion), is to let the wider
> community know in advance that Hadoop 2.x support is going to be dropped in
> favor of 3.x (so they have time to adjust their deployments accordingly).
> If that sort of community-wide notification has already been done and I
> missed it, please let me know.
>
> Thanks!
>   Itai
>
> On Tue, Jun 8, 2021 at 11:08 AM Clint Wylie  wrote:
>
> > Hi all,
> >
> > I've been assisting with some experiments to see how we might want to
> > migrate Druid to support Hadoop 3.x, and more importantly, see if maybe we
> > can finally be free of some of the dependency issues it has been causing
> > for as long as I can remember working with Druid.
> >
> > Hadoop 3 introduced shaded client jars,
> > https://issues.apache.org/jira/browse/HADOOP-11804, with the purpose to
> > allow applications to talk to the Hadoop cluster without drowning in its
> > transitive dependencies. The experimental branch that I have been helping
> > with, which is using these new shaded client jars, can be seen in this PR
> > https://github.com/apache/druid/pull/11314, and is currently working with
> > the HDFS integration tests as well as the Hadoop tutorial flow in the Druid
> > docs (which is pretty much equivalent to the HDFS integration test).
> >
> > The cloud deep storages still need some further testing and some minor
> > cleanup still needs done for the docs and such. Additionally we still need
> > to figure out how to handle the Kerberos extension, because it extends some
> > Hadoop classes so isn't able to use the shaded client jars in a
> > straight-forward manner, and so still has heavy dependencies and hasn't
> > been tested. However, the experiment has started to pan out enough to where
> > I think it is worth starting this discussion, because it does have some
> > implications.
> >
> > Making this change I think will allow us to update our dependencies with a
> > lot more freedom (I'm looking at you, Guava), but the catch is that once we
> > make this change and start updating these dependencies, it will become
> > hard, nearing impossible to support Hadoop 2.x, since as far as I know
> > there isn't an equivalent set of shaded client jars. I am also not certain
> > how far back the Hadoop job classpath isolation stuff goes
> > (mapreduce.job.classloader = true) which I think is required to be set on
> > Druid tasks for this shaded stuff to work alongside updated Druid
> > dependencies.
> >
> > Is anyone opposed to or worried about dropping Hadoop 2.x support after the
> > Druid 0.22 release?
> >

-
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org



Re: [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

2021-06-08 Thread Rajiv Mordani
Also how about officially supporting minio? I know that support for s3 exists 
but it will be good to officially support minio as well as the deep storage.


  *   Rajiv

From: Clint Wylie 
Date: Tuesday, June 8, 2021 at 1:08 AM
To: dev@druid.apache.org 
Subject: [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x
Hi all,

I've been assisting with some experiments to see how we might want to
migrate Druid to support Hadoop 3.x, and more importantly, see if maybe we
can finally be free of some of the dependency issues it has been causing
for as long as I can remember working with Druid.

Hadoop 3 introduced shaded client jars,
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHADOOP-11804data=04%7C01%7Crmordani%40vmware.com%7Cfd9387a4854f48588f4408d92a549684%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587365059243826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=ZH1VZGMf5a%2F7VT2R06kTBUUcJWNqENM%2Bnk%2Ba8OvAYJc%3Dreserved=0,
 with the purpose to
allow applications to talk to the Hadoop cluster without drowning in its
transitive dependencies. The experimental branch that I have been helping
with, which is using these new shaded client jars, can be seen in this PR
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fdruid%2Fpull%2F11314data=04%7C01%7Crmordani%40vmware.com%7Cfd9387a4854f48588f4408d92a549684%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587365059243826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=R5KYZcePTmLucDf3WIDZ%2FUsxjVeh%2Bs9TkCjoXV3vMHo%3Dreserved=0,
 and is currently working with
the HDFS integration tests as well as the Hadoop tutorial flow in the Druid
docs (which is pretty much equivalent to the HDFS integration test).

The cloud deep storages still need some further testing and some minor
cleanup still needs done for the docs and such. Additionally we still need
to figure out how to handle the Kerberos extension, because it extends some
Hadoop classes so isn't able to use the shaded client jars in a
straight-forward manner, and so still has heavy dependencies and hasn't
been tested. However, the experiment has started to pan out enough to where
I think it is worth starting this discussion, because it does have some
implications.

Making this change I think will allow us to update our dependencies with a
lot more freedom (I'm looking at you, Guava), but the catch is that once we
make this change and start updating these dependencies, it will become
hard, nearing impossible to support Hadoop 2.x, since as far as I know
there isn't an equivalent set of shaded client jars. I am also not certain
how far back the Hadoop job classpath isolation stuff goes
(mapreduce.job.classloader = true) which I think is required to be set on
Druid tasks for this shaded stuff to work alongside updated Druid
dependencies.

Is anyone opposed to or worried about dropping Hadoop 2.x support after the
Druid 0.22 release?


Re: Enabling dependabot in our github repository

2021-06-08 Thread Gian Merlino
Here's a running list of PRs opened by the dependabot:
https://github.com/apache/druid/pulls?q=is%3Apr+author%3Aapp%2Fdependabot

On Mon, Jun 7, 2021 at 12:22 PM Gian Merlino  wrote:

> There's been some extra discussion this PR:
> https://github.com/apache/druid/pull/11079
>
> I just +1'ed it, but I wanted to come back here to say that IMO, we should
> avoid getting in the habit of blindly applying these updates without
> testing. There's been lots of situations in the past where a
> harmless-looking dependency upgrade broke something. Sometimes the new
> dependency version had a regression in it, and sometimes even without
> regressions it can introduce compatibility problems.
>
> So, I think it'd be good to apply the updates when we're confident in our
> ability to test them, and add ignores (or tests!) for the rest.
>
> On Thu, Apr 8, 2021 at 12:35 PM Xavier Léauté 
> wrote:
>
>> Thanks Maytas, I asked in that thread. They seemed concerned about write
>> access requested by dependabot,
>> but that should no longer be required as far as I can tell, now that it is
>> natively integrated into GitHub.
>> It should only be a matter of adding the config file to the repo, similar
>> to what we do to automate closing stale issues / PR.
>>
>> On Tue, Apr 6, 2021 at 2:50 PM Maytas Monsereenusorn 
>> wrote:
>>
>> > I remember seeing someone asked about Dependabot in asfinfra slack
>> channel
>> > a few weeks ago. However, asfinfra said they cannot allow it.
>> > Here is the link:
>> > https://the-asf.slack.com/archives/CBX4TSBQ8/p1616539376210800
>> > I think this is the same as Github's dependabot.
>> >
>> > Best Regards,
>> > Maytas
>> >
>> >
>> > On Tue, Apr 6, 2021 at 2:37 PM Xavier Léauté  wrote:
>> >
>> > > Hi folks, as you know Druid has a lot of dependencies, and keeping up
>> > with
>> > > the latest versions of everything, whether it relates to fixing CVEs
>> or
>> > > other improvements is a lot of manual work.
>> > >
>> > > I suggest we enable Github's dependabot in our repository to keep our
>> > > dependencies up to date. The bot is also helpful in providing a short
>> > > commit log summary to understand changes.
>> > > This might yield a flurry of PRs initially, but we can configure it to
>> > > exclude libraries or version ranges that we know are unsafe for us to
>> > > upgrade to.
>> > >
>> > > It looks like some other ASF repos have this enabled already (see
>> > > https://github.com/apache/commons-imaging/pull/126), so hopefully
>> this
>> > > only
>> > > requires filing an INFRA ticket.
>> > >
>> > > Happy to take care of it if folks are on board.
>> > >
>> > > Thanks!
>> > > Xavier
>> > >
>> >
>>
>


Re: [DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

2021-06-08 Thread Itai Yaffe
Hey Clint,
I think it's definitely a step in the right direction.
One thing I would suggest, since the are several deployments using Hadoop
(either for deep storage and/or for ingestion), is to let the wider
community know in advance that Hadoop 2.x support is going to be dropped in
favor of 3.x (so they have time to adjust their deployments accordingly).
If that sort of community-wide notification has already been done and I
missed it, please let me know.

Thanks!
  Itai

On Tue, Jun 8, 2021 at 11:08 AM Clint Wylie  wrote:

> Hi all,
>
> I've been assisting with some experiments to see how we might want to
> migrate Druid to support Hadoop 3.x, and more importantly, see if maybe we
> can finally be free of some of the dependency issues it has been causing
> for as long as I can remember working with Druid.
>
> Hadoop 3 introduced shaded client jars,
> https://issues.apache.org/jira/browse/HADOOP-11804, with the purpose to
> allow applications to talk to the Hadoop cluster without drowning in its
> transitive dependencies. The experimental branch that I have been helping
> with, which is using these new shaded client jars, can be seen in this PR
> https://github.com/apache/druid/pull/11314, and is currently working with
> the HDFS integration tests as well as the Hadoop tutorial flow in the Druid
> docs (which is pretty much equivalent to the HDFS integration test).
>
> The cloud deep storages still need some further testing and some minor
> cleanup still needs done for the docs and such. Additionally we still need
> to figure out how to handle the Kerberos extension, because it extends some
> Hadoop classes so isn't able to use the shaded client jars in a
> straight-forward manner, and so still has heavy dependencies and hasn't
> been tested. However, the experiment has started to pan out enough to where
> I think it is worth starting this discussion, because it does have some
> implications.
>
> Making this change I think will allow us to update our dependencies with a
> lot more freedom (I'm looking at you, Guava), but the catch is that once we
> make this change and start updating these dependencies, it will become
> hard, nearing impossible to support Hadoop 2.x, since as far as I know
> there isn't an equivalent set of shaded client jars. I am also not certain
> how far back the Hadoop job classpath isolation stuff goes
> (mapreduce.job.classloader = true) which I think is required to be set on
> Druid tasks for this shaded stuff to work alongside updated Druid
> dependencies.
>
> Is anyone opposed to or worried about dropping Hadoop 2.x support after the
> Druid 0.22 release?
>


[DISCUSS] Hadoop 3, dropping support for Hadoop 2.x

2021-06-08 Thread Clint Wylie
Hi all,

I've been assisting with some experiments to see how we might want to
migrate Druid to support Hadoop 3.x, and more importantly, see if maybe we
can finally be free of some of the dependency issues it has been causing
for as long as I can remember working with Druid.

Hadoop 3 introduced shaded client jars,
https://issues.apache.org/jira/browse/HADOOP-11804, with the purpose to
allow applications to talk to the Hadoop cluster without drowning in its
transitive dependencies. The experimental branch that I have been helping
with, which is using these new shaded client jars, can be seen in this PR
https://github.com/apache/druid/pull/11314, and is currently working with
the HDFS integration tests as well as the Hadoop tutorial flow in the Druid
docs (which is pretty much equivalent to the HDFS integration test).

The cloud deep storages still need some further testing and some minor
cleanup still needs done for the docs and such. Additionally we still need
to figure out how to handle the Kerberos extension, because it extends some
Hadoop classes so isn't able to use the shaded client jars in a
straight-forward manner, and so still has heavy dependencies and hasn't
been tested. However, the experiment has started to pan out enough to where
I think it is worth starting this discussion, because it does have some
implications.

Making this change I think will allow us to update our dependencies with a
lot more freedom (I'm looking at you, Guava), but the catch is that once we
make this change and start updating these dependencies, it will become
hard, nearing impossible to support Hadoop 2.x, since as far as I know
there isn't an equivalent set of shaded client jars. I am also not certain
how far back the Hadoop job classpath isolation stuff goes
(mapreduce.job.classloader = true) which I think is required to be set on
Druid tasks for this shaded stuff to work alongside updated Druid
dependencies.

Is anyone opposed to or worried about dropping Hadoop 2.x support after the
Druid 0.22 release?