https://issues.apache.org/jira/browse/HUDI-288 tracks this....
On Tue, Oct 1, 2019 at 10:17 AM Vinoth Chandar <[email protected]> wrote: > > I think this has come up before. > > +1 to the point pratyaksh mentioned. I would like to add a few more > > - Schema could be fetched dynamically from a registry based on > topic/dataset name. Solvable > - The hudi keys, partition fields and the inputs you need for configuring > hudi needs to be standardized. Solvable using dataset level overrides. > - You will get one RDD from Kafka with data for multiple topics. This > needs to be now forked to multiple datasets. We need to cache the kafka RDD > in memory, otherwise we will recompute and re-read input everytime from > Kafka. Expensive. Solvable. > - Finally, you will be writing different parquet schema from different > files and if you are running with num_core > 2, also concurrently. At Uber, > we original did that and it became an operational nightmare to isolate bad > topics from good ones.. Pretty tricky! > > In all, we could support this and call out these caveats well. > > In terms of work, > > - We can either introduce multi source support to DeltaStreamer natively > (more involved design work needed to specify how each input stream maps to > each output stream) > - (Or) we can write a new tool that wraps the current DeltaStreamer, just > uses the kafka topic regex to identify all topics that need to be ingested, > and just creates one delta streamer each topic within a SINGLE spark > application. > > > Any takers for this? Should be a pretty cool project, doable in a week or > two. > > /thanks/vinoth > > On Tue, Oct 1, 2019 at 12:39 AM Pratyaksh Sharma <[email protected]> > wrote: > >> Hi Gurudatt, >> >> With a minimal code change, you can subscribe to multiple Kafka topics >> using KafkaOffsetGen.java class. I feel the bigger problem in this case is >> going to be managing multiple target schemas because we register >> ParquetWriter with a single target schema at a time. I would also like to >> know if we have a workaround for such a case. >> >> On Tue, Oct 1, 2019 at 12:33 PM Gurudatt Kulkarni <[email protected]> >> wrote: >> >> > Hi All, >> > >> > I have a use case where I need to pull multiple tables (say close to >> 100) >> > into Hadoop. Do we need to schedule 100 Hudi jobs to pull these tables? >> Can >> > there be a workaround where there is one Hudi Application pulling from >> > multiple Kafka topics? This will avoid creating multiple SparkSessions >> and >> > avoid the memory overhead that comes with it. >> > >> > Regards, >> > Gurudatt >> > >> >
