Re: [DISCUSS] Hadoop ingestion support

Lucas Capistrant Thu, 12 Dec 2024 07:55:50 -0800

Apologies for the empty email… fat fingers.

Just wanted to say that we at Target Corporation (USA), still rely heavily
on Hadoop ingest. We’d selfishly want support forever, but if forced to
pivot to a new ingestion style for our larger batch ingest jobs that
currently leverage the cheap compute on YARN, the longer the lead time
between announcement by the community to the actual release with no
support, the better. Making these types of changes can be a slow process
for the slow to maneuver corporate cruise ship.


On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant <[email protected]>
wrote:

>
>
> On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <[email protected]> wrote:
>
>> +1 for removal of Hadoop based ingestion. It's a maintenance overhead and
>> stops us from moving to java 17.
>> I am not aware of any gaps in sql based ingestion which limits users to
>> move off from hadoop. If there are any, please feel free to reach out via
>> slack/github.
>>
>> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <[email protected]> wrote:
>>
>> > Hey everyone,
>> >
>> > It is about that time again to take a pulse on how commonly Hadoop
>> > based ingestion is used with Druid in order to determine if we should
>> > keep supporting it or not going forward.
>> >
>> > In my view, Hadoop based ingestion has unofficially been on life
>> > support for quite some time as we do not really go out of our way to
>> > add new features to it, and we perform very minimal testing to ensure
>> > everything keeps working. The most recent changes to it I am aware of
>> > was to bump versions and require Hadoop 3, but that was primarily
>> > motivated by selfish reasons of wanting to use its contained client
>> > library and better isolation so that we could free up our own
>> > dependencies to be updated. This thread is motivated by a similar
>> > reason I guess, see the other thread I started recently discussing
>> > dropping support for Java 11 where Hadoop does not yet support Java 17
>> > runtime, and so the outcome of this discussion is involved in those
>> > plans.
>> >
>> > I think SQL based ingestion with the multi-stage query engine is the
>> > future of batch ingestion, and the Kubernetes based task runner
>> > provides an alternative for task auto scaling capabilities. Because of
>> > this, I don't personally see a lot of compelling reasons to keep
>> > supporting Hadoop, so I would be in favor of just dropping support for
>> > it completely, though I see no harm in keeping HDFS deep storage
>> > around. In past discussions I think we had tied Hadoop removal to
>> > adding something like Spark to replace it, but I wonder if this still
>> > needs to be the case.
>> >
>> > I do know that classically there have been quite a lot of large Druid
>> > clusters in the wild still relying on Hadoop in previous dev list
>> > discussions about this topic, so I wanted to check to see if this is
>> > still true and if so if any of these clusters have plans to transition
>> > to newer ways of ingesting data like SQL based ingestion. While from a
>> > dev/maintenance perspective it would be best to just drop it
>> > completely, if there is still a large user base I think we need to be
>> > open to keeping it around for a while longer. If we do need to keep
>> > it, maybe it would be worth it to invest some time in moving it into a
>> > contrib extension so that it isn't bundled by default with Druid
>> > releases to discourage new adoption and more accurately reflect its
>> > current status in Druid.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>>
>

Re: [DISCUSS] Hadoop ingestion support

Reply via email to