Re: Should UnboundedSource provide a split identifier ?

Eugene Kirpichov Tue, 06 Sep 2016 16:05:23 -0700

Oh! Okay, looks like this is a part of the code I was unfamiliar with. I'd
like to know the answer too, in this case.
+Daniel Mills <mil...@google.com> can you comment ?


On Tue, Sep 6, 2016 at 3:32 PM Amit Sela <amitsel...@gmail.com> wrote:

> That is correct, as long as non of the Kafka topics "grow" another
> partition (which it could).
> In that case, some bundle will have to start reading from this partition as
> well, and since all other bundles already have a "previous checkpoint" it
> matters which checkpoint to relate to. I don't know how it's implemented in
> Dataflow, but in Spark I'm testing using mapWithState to store the
> checkpoints per split.
> It also seems that order does matter to the KafkIO:
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L636
>
> On Wed, Sep 7, 2016 at 1:24 AM Eugene Kirpichov
> <kirpic...@google.com.invalid> wrote:
>
> > Hi Amit,
> > Could you explain more about why you're saying the order of splits
> matters?
> > AFAIK the semantics of Read.Unbounded is "read from all of the splits in
> > parallel, checkpointing each of them independently", so their order
> > shouldn't matter.
> >
> > On Tue, Sep 6, 2016 at 3:17 PM Amit Sela <amitsel...@gmail.com> wrote:
> >
> > > UnboundedSources generate initial splits, which are basically splits of
> > > them selves - for example, if an UnboundedKafkaSource is set to read
> from
> > > topic T1 which is made of 2 partitions P1 and P2, it will (optimally)
> > split
> > > into two UnboundedKafkaSource, one per partition.
> > > During the lifecycle of the "reader" bundles, CheckpointMarks are used
> to
> > > checkpoint "last-read" and so readers may restart/resume. I'm assuming
> > this
> > > is how newly created partitions will automatically be added to readers.
> > >
> > > The problem is that it's all fine while there is only one topic (Kafka
> > > topic-partition pairs are ordered), but if reading from more then one
> > topic
> > > this may break:
> > > T1,P1
> > > T1,P2
> > > T1,P3
> > > T2,P1
> > > The order is not maintained and T2,P1 is 4th now.
> > >
> > > If splits (UnboundedSources) had an identifier, this could be avoided,
> > and
> > > checkpoints could be persisted accordingly.
> > > For example, an UnboundedKafkaSource could use the hash code of it's
> > > assigned topic-partition pairs.
> > >
> > > I don't know how relevant this is to other Sources, but I guess it is
> as
> > > long as they may grow their partitions dynamically (though I might be
> > > completely wrong...) and I don't see much of a downside.
> > >
> > > Thoughts ?
> > >
> > > This is something that troubles me now while working on Read.Unbounded,
> > and
> > > from a quick look I saw that the FlinkRunner expects "stable" splitting
> > as
> > > well.. How does Dataflow handle this ?
> > >
> > > Thanks,
> > > Amit
> > >
> >
>

Re: Should UnboundedSource provide a split identifier ?

Reply via email to