Like the plan of having a hadoop profile, not shipping it a part of the apache release and then we can eventually remove it in a release or 2 . Does that work for you folks Maytas, Lucas ?
On Mon, Apr 7, 2025 at 3:59 PM Zoltan Haindrich <k...@rxd.hu> wrote: > Hey, > > I was also bumping into this while I was running dependency-checks for > Druid-33 > * I've encountered a CVE [1] in hadoop-runtime-3.3.6 which is a shaded jar > * we have a PR to upgrade to 3.4.0 ; so I checked also 3.4.1 - but they > are also affected as they ship with (jetty is 9.4.53.v20231009) [2] > > ..so right now there is no normal way to solve this - the fact that its a > shaded jar further complicates things.. > > Note: the trunk Hadoop uses jetty 9.4.57 [3] - which is good; so there > will be some future version which might be not affected > I wanted to be thorough and digged into a few things - to see how soon an > updated version may come out: > * there are a 300+ tickets targeted for 3.5.0 .. so that doesn't looks > promising > * but even for 3.4.2 there is a huge jira [4] with 159 subtasks out of > which 123 is unassigned... > if that's really needed for 3.4.2 then I doubt they'll be rolling out a > release soon... > * I was also peeking into jdk17 jiras which will most likely arrive in > 3.5.0 [5] > > Keeping Hadoop like this will hold us back from: > * upgrading 3rd party deps > * forces us to add security supressions > * slows down newer jdk adoption - as officially hadoop only supports 11 > > I think most of the companies using Hadoop are utilizing binaries which > are being built from forks - and they also have the ability&bandwidth to > fix these 3rd party > libraries... > I would also guess that they might be also using a custom built Druid - > and as a result: they have more control over what kind of features they > have or not. > > So I was wondering about the following: > * add a maven profile for hadoop support (defaults to off) > * retain compaibility: during CI runs: build with jdk11 and run all hadoop > tests > * future releases (>=34) would ship w/o hadoop ingestion > * companies using hadoop-ingestion could turn on the profile and use it > > What do you guys think? > > cheers, > Zoltan > > > [1] https://nvd.nist.gov/vuln/detail/cve-2024-22201 > [2] > https://github.com/apache/hadoop/blob/626b227094027ed08883af97a0734d2db7863864/hadoop-project/pom.xml#L40 > [3] > https://github.com/apache/hadoop/blob/3d2f4d669edcf321509ceacde58a8160aef06a8c/hadoop-project/pom.xml#L40 > [4] https://issues.apache.org/jira/browse/HADOOP-19353 > [5] https://issues.apache.org/jira/browse/HADOOP-17177 > > > On 1/8/25 11:56, Abhishek Agarwal wrote: > > @Adarsh - FYI since you are the release manager for 32. > > > > On Wed, Jan 8, 2025 at 11:53 AM Abhishek Agarwal <abhis...@apache.org> > > wrote: > > > >> I don't want to kick that can too far down the road either :) We don't > >> want to give a false hope that it's going to remain around forever. But > yes > >> let's deprecate both Hadoop and Java 11 support in the upcoming 32 > release. > >> It's unfortunate that Hadoop still doesn't support Java 17. We shouldn't > >> let it hold us back. Jetty, pac4j are dropping Java 11 support and we > would > >> want to upgrade to newer versions of these dependencies soon. There are > >> also nice language features in Java 17 such as pattern matching, > multiline > >> strings, and a lot more that we can't use if we have to be compile > >> compatible with Java 11. If you need the resource elasticity that Hadoop > >> provides or want to reuse shared infrastructure in the company, MM-less > >> ingestion is a good alternative. > >> > >> So let's deprecate it in 32. We can decide on removal later but > hopefully, > >> it doesn't take too many releases to do that. > >> > >> On Tue, Jan 7, 2025 at 4:22 PM Karan Kumar <ka...@apache.org> wrote: > >> > >>> Okay from what I can gather few folks still need hadoop ingestion. So > >>> let's > >>> kick the can down the road regarding removal of that support but let's > >>> agree on the deprecation plan. Since druid 32 is around the corner > let's > >>> atleast deprecated hadoop ingestion so that any new users are not > >>> onboarded > >>> to this way of ingestion. Deprecation also becomes a forcing function > in > >>> internal company channel's for prioritization of getting off hadoop. > >>> > >>> How does this plan look? > >>> > >>> On Fri, Dec 13, 2024 at 1:11 AM Maytas Monsereenusorn < > mayt...@apache.org > >>>> > >>> wrote: > >>> > >>>> We at Netflix are in a similar situation to Target Corporation (Lucas > C > >>>> email above). > >>>> We currently rely on Hadoop ingestion for all our batch ingestion > jobs. > >>> The > >>>> main reason for this is that we already have a large Hadoop cluster > >>>> supporting our Spark workloads that we can leverage for Druid > >>> ingestion. I > >>>> imagine that the closest alternative for us would be moving to K8 / > >>>> MiddleManager-less ingestion job. > >>>> > >>>> On Thu, Dec 12, 2024 at 10:56 PM Lucas Capistrant < > >>>> capistrant.lu...@gmail.com> wrote: > >>>> > >>>>> Apologies for the empty email… fat fingers. > >>>>> > >>>>> Just wanted to say that we at Target Corporation (USA), still rely > >>>> heavily > >>>>> on Hadoop ingest. We’d selfishly want support forever, but if forced > >>> to > >>>>> pivot to a new ingestion style for our larger batch ingest jobs that > >>>>> currently leverage the cheap compute on YARN, the longer the lead > time > >>>>> between announcement by the community to the actual release with no > >>>>> support, the better. Making these types of changes can be a slow > >>> process > >>>>> for the slow to maneuver corporate cruise ship. > >>>>> > >>>>> On Thu, Dec 12, 2024 at 9:46 AM Lucas Capistrant < > >>>>> capistrant.lu...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, Dec 11, 2024 at 9:10 PM Karan Kumar <ka...@apache.org> > >>> wrote: > >>>>>> > >>>>>>> +1 for removal of Hadoop based ingestion. It's a maintenance > >>> overhead > >>>>> and > >>>>>>> stops us from moving to java 17. > >>>>>>> I am not aware of any gaps in sql based ingestion which limits > >>> users > >>>> to > >>>>>>> move off from hadoop. If there are any, please feel free to reach > >>> out > >>>>> via > >>>>>>> slack/github. > >>>>>>> > >>>>>>> On Thu, Dec 12, 2024 at 3:22 AM Clint Wylie <cwy...@apache.org> > >>>> wrote: > >>>>>>> > >>>>>>>> Hey everyone, > >>>>>>>> > >>>>>>>> It is about that time again to take a pulse on how commonly > >>> Hadoop > >>>>>>>> based ingestion is used with Druid in order to determine if we > >>>> should > >>>>>>>> keep supporting it or not going forward. > >>>>>>>> > >>>>>>>> In my view, Hadoop based ingestion has unofficially been on life > >>>>>>>> support for quite some time as we do not really go out of our > >>> way to > >>>>>>>> add new features to it, and we perform very minimal testing to > >>>> ensure > >>>>>>>> everything keeps working. The most recent changes to it I am > >>> aware > >>>> of > >>>>>>>> was to bump versions and require Hadoop 3, but that was primarily > >>>>>>>> motivated by selfish reasons of wanting to use its contained > >>> client > >>>>>>>> library and better isolation so that we could free up our own > >>>>>>>> dependencies to be updated. This thread is motivated by a similar > >>>>>>>> reason I guess, see the other thread I started recently > >>> discussing > >>>>>>>> dropping support for Java 11 where Hadoop does not yet support > >>> Java > >>>> 17 > >>>>>>>> runtime, and so the outcome of this discussion is involved in > >>> those > >>>>>>>> plans. > >>>>>>>> > >>>>>>>> I think SQL based ingestion with the multi-stage query engine is > >>> the > >>>>>>>> future of batch ingestion, and the Kubernetes based task runner > >>>>>>>> provides an alternative for task auto scaling capabilities. > >>> Because > >>>> of > >>>>>>>> this, I don't personally see a lot of compelling reasons to keep > >>>>>>>> supporting Hadoop, so I would be in favor of just dropping > >>> support > >>>> for > >>>>>>>> it completely, though I see no harm in keeping HDFS deep storage > >>>>>>>> around. In past discussions I think we had tied Hadoop removal to > >>>>>>>> adding something like Spark to replace it, but I wonder if this > >>>> still > >>>>>>>> needs to be the case. > >>>>>>>> > >>>>>>>> I do know that classically there have been quite a lot of large > >>>> Druid > >>>>>>>> clusters in the wild still relying on Hadoop in previous dev list > >>>>>>>> discussions about this topic, so I wanted to check to see if > >>> this is > >>>>>>>> still true and if so if any of these clusters have plans to > >>>> transition > >>>>>>>> to newer ways of ingesting data like SQL based ingestion. While > >>>> from a > >>>>>>>> dev/maintenance perspective it would be best to just drop it > >>>>>>>> completely, if there is still a large user base I think we need > >>> to > >>>> be > >>>>>>>> open to keeping it around for a while longer. If we do need to > >>> keep > >>>>>>>> it, maybe it would be worth it to invest some time in moving it > >>>> into a > >>>>>>>> contrib extension so that it isn't bundled by default with Druid > >>>>>>>> releases to discourage new adoption and more accurately reflect > >>> its > >>>>>>>> current status in Druid. > >>>>>>>> > >>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > >>>>>>>> For additional commands, e-mail: dev-h...@druid.apache.org > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > >