Re: Hudi Concurrent Ingestion with Spark Streaming

nishith agarwal Thu, 17 Sep 2020 10:40:15 -0700

Great!

-Nishith


On Thu, Sep 17, 2020 at 10:28 AM tanu dua <tanu.dua...@gmail.com> wrote:

> Thank you so much Nisheth. I understand now how it’s going to work.
>
> On Wed, 16 Sep 2020 at 11:15 PM, nishith agarwal <n3.nas...@gmail.com>
> wrote:
>
> > Tanu,
> >
> >
> >
> > I'm assuming you're talking about multiple kafka partitions from a single
> >
> > Spark Streaming job. In this case, your job can read from
> >
> > multiple partitions but at the end, this data should be written to a
> single
> >
> > table. The dataset/rdd resulting from reading multiple partitions is
> passed
> >
> > as a whole to the Hudi writer and spark parallelism takes care of
> ensuring
> >
> > you don't lose the kafka partition parallelism. In this case, there are
> no
> >
> > "multi-writers" to Hudi tables. Is your setup different from the one I
> >
> > described ?
> >
> >
> >
> > -Nishith
> >
> >
> >
> > On Wed, Sep 16, 2020 at 9:50 AM tanu dua <tanu.dua...@gmail.com> wrote:
> >
> >
> >
> > > Hi,
> >
> > > I need to try myself more on this but how Hudi concurrent ingestion
> works
> >
> > > with Spark Streaming.
> >
> > > We have multiple Kafka partitions on which Spark is listening on so
> there
> >
> > > is a possibility that at any given point of time multiple executors
> will
> > be
> >
> > > reading the kafka partitions and start ingesting data. What is the
> >
> > > behaviour I can expect from Hudi. It’s possible that they may writing
> to
> >
> > > the same Hudi partition.
> >
> > >
> >
> > > Would both writes be successful ? Would one overwrite another if both
> > have
> >
> > > same primary key ?
> >
> > >
> >
> >
>

Re: Hudi Concurrent Ingestion with Spark Streaming

Reply via email to