Re: Hudi Concurrent Ingestion with Spark Streaming

nishith agarwal Wed, 16 Sep 2020 10:45:06 -0700

Tanu,

I'm assuming you're talking about multiple kafka partitions from a single
Spark Streaming job. In this case, your job can read from
multiple partitions but at the end, this data should be written to a single
table. The dataset/rdd resulting from reading multiple partitions is passed
as a whole to the Hudi writer and spark parallelism takes care of ensuring
you don't lose the kafka partition parallelism. In this case, there are no
"multi-writers" to Hudi tables. Is your setup different from the one I
described ?


-Nishith

On Wed, Sep 16, 2020 at 9:50 AM tanu dua <tanu.dua...@gmail.com> wrote:

> Hi,
> I need to try myself more on this but how Hudi concurrent ingestion works
> with Spark Streaming.
> We have multiple Kafka partitions on which Spark is listening on so there
> is a possibility that at any given point of time multiple executors will be
> reading the kafka partitions and start ingesting data. What is the
> behaviour I can expect from Hudi. It’s possible that they may writing to
> the same Hudi partition.
>
> Would both writes be successful ? Would one overwrite another if both have
> same primary key ?
>

Re: Hudi Concurrent Ingestion with Spark Streaming

Reply via email to