I think this has come up before.

+1 to the point pratyaksh mentioned. I would like to add a few more

- Schema could be fetched dynamically from a registry based on
topic/dataset name. Solvable
- The hudi keys, partition fields and the inputs you need for configuring
hudi needs to be standardized. Solvable using dataset level overrides.
- You will get one RDD from Kafka with data for multiple topics. This needs
to be now forked to multiple datasets. We need to cache the kafka RDD in
memory, otherwise we will recompute and re-read input everytime from Kafka.
Expensive. Solvable.
- Finally, you will be writing different parquet schema from different
files and if you are running with num_core > 2, also concurrently. At Uber,
we original did that and it became an operational nightmare to isolate bad
topics from good ones.. Pretty tricky!

In all, we could support this and call out these caveats well.

In terms of work,

- We can either introduce multi source support to DeltaStreamer natively
(more involved design work needed to specify how each input stream maps to
each output stream)
- (Or) we can write a new tool that wraps the current DeltaStreamer, just
uses the kafka topic regex to identify all topics that need to be ingested,
and just creates one delta streamer each topic within a SINGLE spark
application.


Any takers for this?  Should be a pretty cool project, doable in a week or
two.

/thanks/vinoth

On Tue, Oct 1, 2019 at 12:39 AM Pratyaksh Sharma <[email protected]>
wrote:

> Hi Gurudatt,
>
> With a minimal code change, you can subscribe to multiple Kafka topics
> using KafkaOffsetGen.java class. I feel the bigger problem in this case is
> going to be managing multiple target schemas because we register
> ParquetWriter with a single target schema at a time. I would also like to
> know if we have a workaround for such a case.
>
> On Tue, Oct 1, 2019 at 12:33 PM Gurudatt Kulkarni <[email protected]>
> wrote:
>
> > Hi All,
> >
> > I have a use case where I need to pull multiple tables (say close to 100)
> > into Hadoop. Do we need to schedule 100 Hudi jobs to pull these tables?
> Can
> > there be a workaround where there is one Hudi Application pulling from
> > multiple Kafka topics? This will avoid creating multiple SparkSessions
> and
> > avoid the memory overhead that comes with it.
> >
> > Regards,
> > Gurudatt
> >
>

Reply via email to