Re: makes bundle concept usable?

Romain Manni-Bucau Thu, 16 Nov 2017 13:13:51 -0800

This is a fair summary of the current state but also where beam can have a
very strong added value and make big data great and smooth.


Instead of this replay feature isnt checkpointing willable? In particular
with SDF no?


Le 16 nov. 2017 19:50, "Raghu Angadi" <[email protected]> a écrit :

> Core issue here is that there is no explicit concept of 'checkpoint' in
> Beam (UnboundedSource has a method 'getCheckpointMark' but that refers to
> the checkoint on external source). Runners do checkpoint internally as
> implementation detail. Flink's checkpoint model is entirely different from
> Dataflow's and Spark's.
>
> @StableReplay helps, but it does not explicitly talk about a checkpoint by
> design.
>
> If you are looking to achieve some guarantees with a sink/DoFn, I think it
> is better to start with the requirements. I worked on exactly-once sink for
> Kafka (see KafkaIO.write().withEOS()), where we essentially reshard the
> elements and assign sequence numbers to elements with in each shard.
> Duplicates in replays are avoided based on these sequence numbers. DoFn
> state API is used to buffer out-of order replays. The implementation
> strategy works in Dataflow but not in Flink which has a horizontal
> checkpoint. KafkaIO checks for compatibility.
>
> On Wed, Nov 15, 2017 at 12:38 AM, Romain Manni-Bucau <
> [email protected]>
> wrote:
>
> > Hi guys,
> >
> > The subject is a bit provocative but the topic is real and coming
> > again and again with the beam usage: how a dofn can handle some
> > "chunking".
> >
> > The need is to be able to commit each N records but with N not too big.
> >
> > The natural API for that in beam is the bundle one but bundles are not
> > reliable since they can be very small (flink) - we can say it is "ok"
> > even if it has some perf impacts - or too big (spark does full size /
> > #workers).
> >
> > The workaround is what we see in the ES I/O: a maxSize which does an
> > eager flush. The issue is that then the checkpoint is not respected
> > and you can process multiple times the same records.
> >
> > Any plan to make this API reliable and controllable from a beam point
> > of view (at least in a max manner)?
> >
> > Thanks,
> > Romain Manni-Bucau
> > @rmannibucau |  Blog | Old Blog | Github | LinkedIn
> >
>

Re: makes bundle concept usable?

Reply via email to