On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <ka...@apache.org> wrote:
> +1 for removal of Hadoop based ingestion. It's a maintenance overhead and > stops us from moving to java 17. > I am not aware of any gaps in sql based ingestion which limits users to > move off from hadoop. If there are any, please feel free to reach out via > slack/github. > > On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <cwy...@apache.org> wrote: > > > Hey everyone, > > > > It is about that time again to take a pulse on how commonly Hadoop > > based ingestion is used with Druid in order to determine if we should > > keep supporting it or not going forward. > > > > In my view, Hadoop based ingestion has unofficially been on life > > support for quite some time as we do not really go out of our way to > > add new features to it, and we perform very minimal testing to ensure > > everything keeps working. The most recent changes to it I am aware of > > was to bump versions and require Hadoop 3, but that was primarily > > motivated by selfish reasons of wanting to use its contained client > > library and better isolation so that we could free up our own > > dependencies to be updated. This thread is motivated by a similar > > reason I guess, see the other thread I started recently discussing > > dropping support for Java 11 where Hadoop does not yet support Java 17 > > runtime, and so the outcome of this discussion is involved in those > > plans. > > > > I think SQL based ingestion with the multi-stage query engine is the > > future of batch ingestion, and the Kubernetes based task runner > > provides an alternative for task auto scaling capabilities. Because of > > this, I don't personally see a lot of compelling reasons to keep > > supporting Hadoop, so I would be in favor of just dropping support for > > it completely, though I see no harm in keeping HDFS deep storage > > around. In past discussions I think we had tied Hadoop removal to > > adding something like Spark to replace it, but I wonder if this still > > needs to be the case. > > > > I do know that classically there have been quite a lot of large Druid > > clusters in the wild still relying on Hadoop in previous dev list > > discussions about this topic, so I wanted to check to see if this is > > still true and if so if any of these clusters have plans to transition > > to newer ways of ingesting data like SQL based ingestion. While from a > > dev/maintenance perspective it would be best to just drop it > > completely, if there is still a large user base I think we need to be > > open to keeping it around for a while longer. If we do need to keep > > it, maybe it would be worth it to invest some time in moving it into a > > contrib extension so that it isn't bundled by default with Druid > > releases to discourage new adoption and more accurately reflect its > > current status in Druid. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > >