Re: [DISCUSS] Hadoop ingestion support

Karan Kumar Tue, 08 Apr 2025 02:27:31 -0700

Like the plan of having a hadoop profile, not shipping it a part of the
apache release and then we can eventually remove it in a release or 2 .
Does that work for you folks Maytas, Lucas ?


On Mon, Apr 7, 2025 at 3:59 PM Zoltan Haindrich <k...@rxd.hu> wrote:

> Hey,
>
> I was also bumping into this while I was running dependency-checks for
> Druid-33
> * I've  encountered a CVE [1] in hadoop-runtime-3.3.6 which is a shaded jar
> * we have a PR to upgrade to 3.4.0 ; so I checked also 3.4.1 - but they
> are also affected as they ship with (jetty is 9.4.53.v20231009) [2]
>
> ..so right now there is no normal way to solve this - the fact that its a
> shaded jar further complicates things..
>
> Note: the trunk Hadoop uses jetty 9.4.57 [3] - which is good; so there
> will be some future version which might be not affected
> I wanted to be thorough and digged into a few things - to see how soon an
> updated version may come out:
> * there are a 300+ tickets targeted for 3.5.0 .. so that doesn't looks
> promising
> * but even for 3.4.2 there is a huge jira [4] with 159 subtasks out of
> which 123 is unassigned...
>    if that's really needed for 3.4.2 then I doubt they'll be rolling out a
> release soon...
> * I was also peeking into jdk17 jiras which will most likely arrive in
> 3.5.0 [5]
>
> Keeping Hadoop like this will hold us back from:
> * upgrading 3rd party deps
> * forces us to add security supressions
> * slows down newer jdk adoption - as officially hadoop only supports 11
>
> I think most of the companies using Hadoop are utilizing binaries which
> are being built from forks - and they also have the ability&bandwidth to
> fix these 3rd party
> libraries...
> I would also guess that they might be also using a custom built Druid -
> and as a result: they have more control over what kind of features they
> have or not.
>
> So I was wondering about the following:
> * add a maven profile for hadoop support (defaults to off)
> * retain compaibility: during CI runs: build with jdk11 and run all hadoop
> tests
> * future releases (>=34) would ship w/o hadoop ingestion
> * companies using hadoop-ingestion could turn on the profile and use it
>
> What do you guys think?
>
> cheers,
> Zoltan
>
>
> [1] https://nvd.nist.gov/vuln/detail/cve-2024-22201
> [2]
> https://github.com/apache/hadoop/blob/626b227094027ed08883af97a0734d2db7863864/hadoop-project/pom.xml#L40
> [3]
> https://github.com/apache/hadoop/blob/3d2f4d669edcf321509ceacde58a8160aef06a8c/hadoop-project/pom.xml#L40
> [4] https://issues.apache.org/jira/browse/HADOOP-19353
> [5] https://issues.apache.org/jira/browse/HADOOP-17177
>
>
> On 1/8/25 11:56, Abhishek Agarwal wrote:
> > @Adarsh - FYI since you are the release manager for 32.
> >
> > On Wed, Jan 8, 2025 at 11:53 AM Abhishek Agarwal <abhis...@apache.org>
> > wrote:
> >
> >> I don't want to kick that can too far down the road either :) We don't
> >> want to give a false hope that it's going to remain around forever. But
> yes
> >> let's deprecate both Hadoop and Java 11 support in the upcoming 32
> release.
> >> It's unfortunate that Hadoop still doesn't support Java 17. We shouldn't
> >> let it hold us back. Jetty, pac4j are dropping Java 11 support and we
> would
> >> want to upgrade to newer versions of these dependencies soon. There are
> >> also nice language features in Java 17 such as pattern matching,
> multiline
> >> strings, and a lot more that we can't use if we have to be compile
> >> compatible with Java 11. If you need the resource elasticity that Hadoop
> >> provides or want to reuse shared infrastructure in the company, MM-less
> >> ingestion is a good alternative.
> >>
> >> So let's deprecate it in 32. We can decide on removal later but
> hopefully,
> >> it doesn't take too many releases to do that.
> >>
> >> On Tue, Jan 7, 2025 at 4:22 PM Karan Kumar <ka...@apache.org> wrote:
> >>
> >>> Okay from what I can gather few folks still need hadoop ingestion. So
> >>> let's
> >>> kick the can down the road regarding removal of that support but let's
> >>> agree on the deprecation plan. Since druid 32 is around the corner
> let's
> >>> atleast deprecated hadoop ingestion so that any new users are not
> >>> onboarded
> >>> to this way of ingestion. Deprecation also becomes a forcing function
> in
> >>> internal company channel's for prioritization of getting off hadoop.
> >>>
> >>> How does this plan look?
> >>>
> >>> On Fri, Dec 13, 2024 at 1:11 AM Maytas Monsereenusorn <
> mayt...@apache.org
> >>>>
> >>> wrote:
> >>>
> >>>> We at Netflix are in a similar situation to Target Corporation (Lucas
> C
> >>>> email above).
> >>>> We currently rely on Hadoop ingestion for all our batch ingestion
> jobs.
> >>> The
> >>>> main reason for this is that we already have a large Hadoop cluster
> >>>> supporting our Spark workloads that we can leverage for Druid
> >>> ingestion. I
> >>>> imagine that the closest alternative for us would be moving to K8 /
> >>>> MiddleManager-less ingestion job.
> >>>>
> >>>> On Thu, Dec 12, 2024 at 10:56 PM Lucas Capistrant <
> >>>> capistrant.lu...@gmail.com> wrote:
> >>>>
> >>>>> Apologies for the empty email… fat fingers.
> >>>>>
> >>>>> Just wanted to say that we at Target Corporation (USA), still rely
> >>>> heavily
> >>>>> on Hadoop ingest. We’d selfishly want support forever, but if forced
> >>> to
> >>>>> pivot to a new ingestion style for our larger batch ingest jobs that
> >>>>> currently leverage the cheap compute on YARN, the longer the lead
> time
> >>>>> between announcement by the community to the actual release with no
> >>>>> support, the better. Making these types of changes can be a slow
> >>> process
> >>>>> for the slow to maneuver corporate cruise ship.
> >>>>>
> >>>>> On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant <
> >>>>> capistrant.lu...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <ka...@apache.org>
> >>> wrote:
> >>>>>>
> >>>>>>> +1 for removal of Hadoop based ingestion. It's a maintenance
> >>> overhead
> >>>>> and
> >>>>>>> stops us from moving to java 17.
> >>>>>>> I am not aware of any gaps in sql based ingestion which limits
> >>> users
> >>>> to
> >>>>>>> move off from hadoop. If there are any, please feel free to reach
> >>> out
> >>>>> via
> >>>>>>> slack/github.
> >>>>>>>
> >>>>>>> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <cwy...@apache.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Hey everyone,
> >>>>>>>>
> >>>>>>>> It is about that time again to take a pulse on how commonly
> >>> Hadoop
> >>>>>>>> based ingestion is used with Druid in order to determine if we
> >>>> should
> >>>>>>>> keep supporting it or not going forward.
> >>>>>>>>
> >>>>>>>> In my view, Hadoop based ingestion has unofficially been on life
> >>>>>>>> support for quite some time as we do not really go out of our
> >>> way to
> >>>>>>>> add new features to it, and we perform very minimal testing to
> >>>> ensure
> >>>>>>>> everything keeps working. The most recent changes to it I am
> >>> aware
> >>>> of
> >>>>>>>> was to bump versions and require Hadoop 3, but that was primarily
> >>>>>>>> motivated by selfish reasons of wanting to use its contained
> >>> client
> >>>>>>>> library and better isolation so that we could free up our own
> >>>>>>>> dependencies to be updated. This thread is motivated by a similar
> >>>>>>>> reason I guess, see the other thread I started recently
> >>> discussing
> >>>>>>>> dropping support for Java 11 where Hadoop does not yet support
> >>> Java
> >>>> 17
> >>>>>>>> runtime, and so the outcome of this discussion is involved in
> >>> those
> >>>>>>>> plans.
> >>>>>>>>
> >>>>>>>> I think SQL based ingestion with the multi-stage query engine is
> >>> the
> >>>>>>>> future of batch ingestion, and the Kubernetes based task runner
> >>>>>>>> provides an alternative for task auto scaling capabilities.
> >>> Because
> >>>> of
> >>>>>>>> this, I don't personally see a lot of compelling reasons to keep
> >>>>>>>> supporting Hadoop, so I would be in favor of just dropping
> >>> support
> >>>> for
> >>>>>>>> it completely, though I see no harm in keeping HDFS deep storage
> >>>>>>>> around. In past discussions I think we had tied Hadoop removal to
> >>>>>>>> adding something like Spark to replace it, but I wonder if this
> >>>> still
> >>>>>>>> needs to be the case.
> >>>>>>>>
> >>>>>>>> I do know that classically there have been quite a lot of large
> >>>> Druid
> >>>>>>>> clusters in the wild still relying on Hadoop in previous dev list
> >>>>>>>> discussions about this topic, so I wanted to check to see if
> >>> this is
> >>>>>>>> still true and if so if any of these clusters have plans to
> >>>> transition
> >>>>>>>> to newer ways of ingesting data like SQL based ingestion. While
> >>>> from a
> >>>>>>>> dev/maintenance perspective it would be best to just drop it
> >>>>>>>> completely, if there is still a large user base I think we need
> >>> to
> >>>> be
> >>>>>>>> open to keeping it around for a while longer. If we do need to
> >>> keep
> >>>>>>>> it, maybe it would be worth it to invest some time in moving it
> >>>> into a
> >>>>>>>> contrib extension so that it isn't bundled by default with Druid
> >>>>>>>> releases to discourage new adoption and more accurately reflect
> >>> its
> >>>>>>>> current status in Druid.
> >>>>>>>>
> >>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> >>>>>>>> For additional commands, e-mail: dev-h...@druid.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] Hadoop ingestion support

Reply via email to