Re: [DISCUSS] Hadoop ingestion support

Maytas Monsereenusorn Tue, 08 Apr 2025 14:02:35 -0700

I'm in favor of removing too but we should not rush the removal and make
sure we give enough time for users to migrate to other types of ingestion.
Similar to what Lucas said, if Hadoop is holding back Druid then we should
remove it. Druid also supports many other types of ingestion compared to
back when Hadoop ingestion was added.
For Netflix, we will be migrating to MM-less Druid ingestion in K8s. I
think MM-less Druid ingestion in K8s is probably the closest to Hadoop
ingestion as we do not have to maintain a dedicated Druid specific MM
cluster (works well for companies with existing large/shared Compute
clusters). Personally, I feel we should focus our energy on things
like MM-less Druid in K8s (which is still marked as Experimental) rather
than Hadoop.


Best Regards,
Maytas

On Tue, Apr 8, 2025 at 4:06 AM Lucas Capistrant <[email protected]>
wrote:

> Yes, I’m in favor of removing it from the core release and also in favor of
> officially announcing deprecation with a timeline for removal, if we have
> not yet. It stinks to lose the Hadoop ingest support, but if that project
> is going to hold back Druid, it seems we don’t have much choice.
>
> Thanks,
> Lucas
>
> On Tue, Apr 8, 2025 at 4:27 AM Karan Kumar <[email protected]> wrote:
>
> >
> > Like the plan of having a hadoop profile, not shipping it a part of the
> > apache release and then we can eventually remove it in a release or 2 .
> > Does that work for you folks Maytas, Lucas ?
> >
> > On Mon, Apr 7, 2025 at 3:59 PM Zoltan Haindrich <[email protected]> wrote:
> >
> >> Hey,
> >>
> >> I was also bumping into this while I was running dependency-checks for
> >> Druid-33
> >> * I've  encountered a CVE [1] in hadoop-runtime-3.3.6 which is a shaded
> >> jar
> >> * we have a PR to upgrade to 3.4.0 ; so I checked also 3.4.1 - but they
> >> are also affected as they ship with (jetty is 9.4.53.v20231009) [2]
> >>
> >> ..so right now there is no normal way to solve this - the fact that its
> a
> >> shaded jar further complicates things..
> >>
> >> Note: the trunk Hadoop uses jetty 9.4.57 [3] - which is good; so there
> >> will be some future version which might be not affected
> >> I wanted to be thorough and digged into a few things - to see how soon
> an
> >> updated version may come out:
> >> * there are a 300+ tickets targeted for 3.5.0 .. so that doesn't looks
> >> promising
> >> * but even for 3.4.2 there is a huge jira [4] with 159 subtasks out of
> >> which 123 is unassigned...
> >>    if that's really needed for 3.4.2 then I doubt they'll be rolling out
> >> a release soon...
> >> * I was also peeking into jdk17 jiras which will most likely arrive in
> >> 3.5.0 [5]
> >>
> >> Keeping Hadoop like this will hold us back from:
> >> * upgrading 3rd party deps
> >> * forces us to add security supressions
> >> * slows down newer jdk adoption - as officially hadoop only supports 11
> >>
> >> I think most of the companies using Hadoop are utilizing binaries which
> >> are being built from forks - and they also have the ability&bandwidth to
> >> fix these 3rd party
> >> libraries...
> >> I would also guess that they might be also using a custom built Druid -
> >> and as a result: they have more control over what kind of features they
> >> have or not.
> >>
> >> So I was wondering about the following:
> >> * add a maven profile for hadoop support (defaults to off)
> >> * retain compaibility: during CI runs: build with jdk11 and run all
> >> hadoop tests
> >> * future releases (>=34) would ship w/o hadoop ingestion
> >> * companies using hadoop-ingestion could turn on the profile and use it
> >>
> >> What do you guys think?
> >>
> >> cheers,
> >> Zoltan
> >>
> >>
> >> [1] https://nvd.nist.gov/vuln/detail/cve-2024-22201
> >> [2]
> >>
> https://github.com/apache/hadoop/blob/626b227094027ed08883af97a0734d2db7863864/hadoop-project/pom.xml#L40
> >> [3]
> >>
> https://github.com/apache/hadoop/blob/3d2f4d669edcf321509ceacde58a8160aef06a8c/hadoop-project/pom.xml#L40
> >> [4] https://issues.apache.org/jira/browse/HADOOP-19353
> >> [5] https://issues.apache.org/jira/browse/HADOOP-17177
> >>
> >>
> >> On 1/8/25 11:56, Abhishek Agarwal wrote:
> >> > @Adarsh - FYI since you are the release manager for 32.
> >> >
> >> > On Wed, Jan 8, 2025 at 11:53 AM Abhishek Agarwal <[email protected]
> >
> >> > wrote:
> >> >
> >> >> I don't want to kick that can too far down the road either :) We
> don't
> >> >> want to give a false hope that it's going to remain around forever.
> >> But yes
> >> >> let's deprecate both Hadoop and Java 11 support in the upcoming 32
> >> release.
> >> >> It's unfortunate that Hadoop still doesn't support Java 17. We
> >> shouldn't
> >> >> let it hold us back. Jetty, pac4j are dropping Java 11 support and we
> >> would
> >> >> want to upgrade to newer versions of these dependencies soon. There
> are
> >> >> also nice language features in Java 17 such as pattern matching,
> >> multiline
> >> >> strings, and a lot more that we can't use if we have to be compile
> >> >> compatible with Java 11. If you need the resource elasticity that
> >> Hadoop
> >> >> provides or want to reuse shared infrastructure in the company,
> MM-less
> >> >> ingestion is a good alternative.
> >> >>
> >> >> So let's deprecate it in 32. We can decide on removal later but
> >> hopefully,
> >> >> it doesn't take too many releases to do that.
> >> >>
> >> >> On Tue, Jan 7, 2025 at 4:22 PM Karan Kumar <[email protected]> wrote:
> >> >>
> >> >>> Okay from what I can gather few folks still need hadoop ingestion.
> So
> >> >>> let's
> >> >>> kick the can down the road regarding removal of that support but
> let's
> >> >>> agree on the deprecation plan. Since druid 32 is around the corner
> >> let's
> >> >>> atleast deprecated hadoop ingestion so that any new users are not
> >> >>> onboarded
> >> >>> to this way of ingestion. Deprecation also becomes a forcing
> function
> >> in
> >> >>> internal company channel's for prioritization of getting off hadoop.
> >> >>>
> >> >>> How does this plan look?
> >> >>>
> >> >>> On Fri, Dec 13, 2024 at 1:11 AM Maytas Monsereenusorn <
> >> [email protected]
> >> >>>>
> >> >>> wrote:
> >> >>>
> >> >>>> We at Netflix are in a similar situation to Target Corporation
> >> (Lucas C
> >> >>>> email above).
> >> >>>> We currently rely on Hadoop ingestion for all our batch ingestion
> >> jobs.
> >> >>> The
> >> >>>> main reason for this is that we already have a large Hadoop cluster
> >> >>>> supporting our Spark workloads that we can leverage for Druid
> >> >>> ingestion. I
> >> >>>> imagine that the closest alternative for us would be moving to K8 /
> >> >>>> MiddleManager-less ingestion job.
> >> >>>>
> >> >>>> On Thu, Dec 12, 2024 at 10:56 PM Lucas Capistrant <
> >> >>>> [email protected]> wrote:
> >> >>>>
> >> >>>>> Apologies for the empty email… fat fingers.
> >> >>>>>
> >> >>>>> Just wanted to say that we at Target Corporation (USA), still rely
> >> >>>> heavily
> >> >>>>> on Hadoop ingest. We’d selfishly want support forever, but if
> forced
> >> >>> to
> >> >>>>> pivot to a new ingestion style for our larger batch ingest jobs
> that
> >> >>>>> currently leverage the cheap compute on YARN, the longer the lead
> >> time
> >> >>>>> between announcement by the community to the actual release with
> no
> >> >>>>> support, the better. Making these types of changes can be a slow
> >> >>> process
> >> >>>>> for the slow to maneuver corporate cruise ship.
> >> >>>>>
> >> >>>>> On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant <
> >> >>>>> [email protected]>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <[email protected]>
> >> >>> wrote:
> >> >>>>>>
> >> >>>>>>> +1 for removal of Hadoop based ingestion. It's a maintenance
> >> >>> overhead
> >> >>>>> and
> >> >>>>>>> stops us from moving to java 17.
> >> >>>>>>> I am not aware of any gaps in sql based ingestion which limits
> >> >>> users
> >> >>>> to
> >> >>>>>>> move off from hadoop. If there are any, please feel free to
> reach
> >> >>> out
> >> >>>>> via
> >> >>>>>>> slack/github.
> >> >>>>>>>
> >> >>>>>>> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <[email protected]>
> >> >>>> wrote:
> >> >>>>>>>
> >> >>>>>>>> Hey everyone,
> >> >>>>>>>>
> >> >>>>>>>> It is about that time again to take a pulse on how commonly
> >> >>> Hadoop
> >> >>>>>>>> based ingestion is used with Druid in order to determine if we
> >> >>>> should
> >> >>>>>>>> keep supporting it or not going forward.
> >> >>>>>>>>
> >> >>>>>>>> In my view, Hadoop based ingestion has unofficially been on
> life
> >> >>>>>>>> support for quite some time as we do not really go out of our
> >> >>> way to
> >> >>>>>>>> add new features to it, and we perform very minimal testing to
> >> >>>> ensure
> >> >>>>>>>> everything keeps working. The most recent changes to it I am
> >> >>> aware
> >> >>>> of
> >> >>>>>>>> was to bump versions and require Hadoop 3, but that was
> primarily
> >> >>>>>>>> motivated by selfish reasons of wanting to use its contained
> >> >>> client
> >> >>>>>>>> library and better isolation so that we could free up our own
> >> >>>>>>>> dependencies to be updated. This thread is motivated by a
> similar
> >> >>>>>>>> reason I guess, see the other thread I started recently
> >> >>> discussing
> >> >>>>>>>> dropping support for Java 11 where Hadoop does not yet support
> >> >>> Java
> >> >>>> 17
> >> >>>>>>>> runtime, and so the outcome of this discussion is involved in
> >> >>> those
> >> >>>>>>>> plans.
> >> >>>>>>>>
> >> >>>>>>>> I think SQL based ingestion with the multi-stage query engine
> is
> >> >>> the
> >> >>>>>>>> future of batch ingestion, and the Kubernetes based task runner
> >> >>>>>>>> provides an alternative for task auto scaling capabilities.
> >> >>> Because
> >> >>>> of
> >> >>>>>>>> this, I don't personally see a lot of compelling reasons to
> keep
> >> >>>>>>>> supporting Hadoop, so I would be in favor of just dropping
> >> >>> support
> >> >>>> for
> >> >>>>>>>> it completely, though I see no harm in keeping HDFS deep
> storage
> >> >>>>>>>> around. In past discussions I think we had tied Hadoop removal
> to
> >> >>>>>>>> adding something like Spark to replace it, but I wonder if this
> >> >>>> still
> >> >>>>>>>> needs to be the case.
> >> >>>>>>>>
> >> >>>>>>>> I do know that classically there have been quite a lot of large
> >> >>>> Druid
> >> >>>>>>>> clusters in the wild still relying on Hadoop in previous dev
> list
> >> >>>>>>>> discussions about this topic, so I wanted to check to see if
> >> >>> this is
> >> >>>>>>>> still true and if so if any of these clusters have plans to
> >> >>>> transition
> >> >>>>>>>> to newer ways of ingesting data like SQL based ingestion. While
> >> >>>> from a
> >> >>>>>>>> dev/maintenance perspective it would be best to just drop it
> >> >>>>>>>> completely, if there is still a large user base I think we need
> >> >>> to
> >> >>>> be
> >> >>>>>>>> open to keeping it around for a while longer. If we do need to
> >> >>> keep
> >> >>>>>>>> it, maybe it would be worth it to invest some time in moving it
> >> >>>> into a
> >> >>>>>>>> contrib extension so that it isn't bundled by default with
> Druid
> >> >>>>>>>> releases to discourage new adoption and more accurately reflect
> >> >>> its
> >> >>>>>>>> current status in Druid.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>
> ---------------------------------------------------------------------
> >> >>>>>>>> To unsubscribe, e-mail: [email protected]
> >> >>>>>>>> For additional commands, e-mail: [email protected]
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >> >
> >>
> >>
>

Re: [DISCUSS] Hadoop ingestion support

Reply via email to