Re: [DISCUSS] Hadoop ingestion support

Abhishek Agarwal Wed, 08 Jan 2025 02:57:08 -0800

@Adarsh - FYI since you are the release manager for 32.

On Wed, Jan 8, 2025 at 11:53 AM Abhishek Agarwal <[email protected]>
wrote:


> I don't want to kick that can too far down the road either :) We don't
> want to give a false hope that it's going to remain around forever. But yes
> let's deprecate both Hadoop and Java 11 support in the upcoming 32 release.
> It's unfortunate that Hadoop still doesn't support Java 17. We shouldn't
> let it hold us back. Jetty, pac4j are dropping Java 11 support and we would
> want to upgrade to newer versions of these dependencies soon. There are
> also nice language features in Java 17 such as pattern matching, multiline
> strings, and a lot more that we can't use if we have to be compile
> compatible with Java 11. If you need the resource elasticity that Hadoop
> provides or want to reuse shared infrastructure in the company, MM-less
> ingestion is a good alternative.
>
> So let's deprecate it in 32. We can decide on removal later but hopefully,
> it doesn't take too many releases to do that.
>
> On Tue, Jan 7, 2025 at 4:22 PM Karan Kumar <[email protected]> wrote:
>
>> Okay from what I can gather few folks still need hadoop ingestion. So
>> let's
>> kick the can down the road regarding removal of that support but let's
>> agree on the deprecation plan. Since druid 32 is around the corner let's
>> atleast deprecated hadoop ingestion so that any new users are not
>> onboarded
>> to this way of ingestion. Deprecation also becomes a forcing function in
>> internal company channel's for prioritization of getting off hadoop.
>>
>> How does this plan look?
>>
>> On Fri, Dec 13, 2024 at 1:11 AM Maytas Monsereenusorn <[email protected]
>> >
>> wrote:
>>
>> > We at Netflix are in a similar situation to Target Corporation (Lucas C
>> > email above).
>> > We currently rely on Hadoop ingestion for all our batch ingestion jobs.
>> The
>> > main reason for this is that we already have a large Hadoop cluster
>> > supporting our Spark workloads that we can leverage for Druid
>> ingestion. I
>> > imagine that the closest alternative for us would be moving to K8 /
>> > MiddleManager-less ingestion job.
>> >
>> > On Thu, Dec 12, 2024 at 10:56 PM Lucas Capistrant <
>> > [email protected]> wrote:
>> >
>> > > Apologies for the empty email… fat fingers.
>> > >
>> > > Just wanted to say that we at Target Corporation (USA), still rely
>> > heavily
>> > > on Hadoop ingest. We’d selfishly want support forever, but if forced
>> to
>> > > pivot to a new ingestion style for our larger batch ingest jobs that
>> > > currently leverage the cheap compute on YARN, the longer the lead time
>> > > between announcement by the community to the actual release with no
>> > > support, the better. Making these types of changes can be a slow
>> process
>> > > for the slow to maneuver corporate cruise ship.
>> > >
>> > > On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant <
>> > > [email protected]>
>> > > wrote:
>> > >
>> > > >
>> > > >
>> > > > On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <[email protected]>
>> wrote:
>> > > >
>> > > >> +1 for removal of Hadoop based ingestion. It's a maintenance
>> overhead
>> > > and
>> > > >> stops us from moving to java 17.
>> > > >> I am not aware of any gaps in sql based ingestion which limits
>> users
>> > to
>> > > >> move off from hadoop. If there are any, please feel free to reach
>> out
>> > > via
>> > > >> slack/github.
>> > > >>
>> > > >> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <[email protected]>
>> > wrote:
>> > > >>
>> > > >> > Hey everyone,
>> > > >> >
>> > > >> > It is about that time again to take a pulse on how commonly
>> Hadoop
>> > > >> > based ingestion is used with Druid in order to determine if we
>> > should
>> > > >> > keep supporting it or not going forward.
>> > > >> >
>> > > >> > In my view, Hadoop based ingestion has unofficially been on life
>> > > >> > support for quite some time as we do not really go out of our
>> way to
>> > > >> > add new features to it, and we perform very minimal testing to
>> > ensure
>> > > >> > everything keeps working. The most recent changes to it I am
>> aware
>> > of
>> > > >> > was to bump versions and require Hadoop 3, but that was primarily
>> > > >> > motivated by selfish reasons of wanting to use its contained
>> client
>> > > >> > library and better isolation so that we could free up our own
>> > > >> > dependencies to be updated. This thread is motivated by a similar
>> > > >> > reason I guess, see the other thread I started recently
>> discussing
>> > > >> > dropping support for Java 11 where Hadoop does not yet support
>> Java
>> > 17
>> > > >> > runtime, and so the outcome of this discussion is involved in
>> those
>> > > >> > plans.
>> > > >> >
>> > > >> > I think SQL based ingestion with the multi-stage query engine is
>> the
>> > > >> > future of batch ingestion, and the Kubernetes based task runner
>> > > >> > provides an alternative for task auto scaling capabilities.
>> Because
>> > of
>> > > >> > this, I don't personally see a lot of compelling reasons to keep
>> > > >> > supporting Hadoop, so I would be in favor of just dropping
>> support
>> > for
>> > > >> > it completely, though I see no harm in keeping HDFS deep storage
>> > > >> > around. In past discussions I think we had tied Hadoop removal to
>> > > >> > adding something like Spark to replace it, but I wonder if this
>> > still
>> > > >> > needs to be the case.
>> > > >> >
>> > > >> > I do know that classically there have been quite a lot of large
>> > Druid
>> > > >> > clusters in the wild still relying on Hadoop in previous dev list
>> > > >> > discussions about this topic, so I wanted to check to see if
>> this is
>> > > >> > still true and if so if any of these clusters have plans to
>> > transition
>> > > >> > to newer ways of ingesting data like SQL based ingestion. While
>> > from a
>> > > >> > dev/maintenance perspective it would be best to just drop it
>> > > >> > completely, if there is still a large user base I think we need
>> to
>> > be
>> > > >> > open to keeping it around for a while longer. If we do need to
>> keep
>> > > >> > it, maybe it would be worth it to invest some time in moving it
>> > into a
>> > > >> > contrib extension so that it isn't bundled by default with Druid
>> > > >> > releases to discourage new adoption and more accurately reflect
>> its
>> > > >> > current status in Druid.
>> > > >> >
>> > > >> >
>> > ---------------------------------------------------------------------
>> > > >> > To unsubscribe, e-mail: [email protected]
>> > > >> > For additional commands, e-mail: [email protected]
>> > > >> >
>> > > >> >
>> > > >>
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Hadoop ingestion support

Reply via email to