+1 for removal of Hadoop based ingestion. It's a maintenance overhead and stops us from moving to java 17. I am not aware of any gaps in sql based ingestion which limits users to move off from hadoop. If there are any, please feel free to reach out via slack/github.
On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <cwy...@apache.org> wrote: > Hey everyone, > > It is about that time again to take a pulse on how commonly Hadoop > based ingestion is used with Druid in order to determine if we should > keep supporting it or not going forward. > > In my view, Hadoop based ingestion has unofficially been on life > support for quite some time as we do not really go out of our way to > add new features to it, and we perform very minimal testing to ensure > everything keeps working. The most recent changes to it I am aware of > was to bump versions and require Hadoop 3, but that was primarily > motivated by selfish reasons of wanting to use its contained client > library and better isolation so that we could free up our own > dependencies to be updated. This thread is motivated by a similar > reason I guess, see the other thread I started recently discussing > dropping support for Java 11 where Hadoop does not yet support Java 17 > runtime, and so the outcome of this discussion is involved in those > plans. > > I think SQL based ingestion with the multi-stage query engine is the > future of batch ingestion, and the Kubernetes based task runner > provides an alternative for task auto scaling capabilities. Because of > this, I don't personally see a lot of compelling reasons to keep > supporting Hadoop, so I would be in favor of just dropping support for > it completely, though I see no harm in keeping HDFS deep storage > around. In past discussions I think we had tied Hadoop removal to > adding something like Spark to replace it, but I wonder if this still > needs to be the case. > > I do know that classically there have been quite a lot of large Druid > clusters in the wild still relying on Hadoop in previous dev list > discussions about this topic, so I wanted to check to see if this is > still true and if so if any of these clusters have plans to transition > to newer ways of ingesting data like SQL based ingestion. While from a > dev/maintenance perspective it would be best to just drop it > completely, if there is still a large user base I think we need to be > open to keeping it around for a while longer. If we do need to keep > it, maybe it would be worth it to invest some time in moving it into a > contrib extension so that it isn't bundled by default with Druid > releases to discourage new adoption and more accurately reflect its > current status in Druid. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > For additional commands, e-mail: dev-h...@druid.apache.org > >