Re: [DISCUSS] UnboundedSource and the KafkaIO.

Raghu Angadi Fri, 07 Oct 2016 17:34:16 -0700

Regd BEAM-704 : sent a PR https://github.com/apache/incubator-beam/pull/1071
for setting 'consumedOffset' in the reader without waiting to read the
first record.


On Fri, Oct 7, 2016 at 5:28 PM, Dan Halperin <[email protected]>
wrote:

> >    4. Reading/persisting Kafka start offsets - since Spark works in
> >    micro-batches, if "latest" was applied on a fairly sparse topic each
> > worker
> >    would actually begin reading only after it saw a message during the
> time
> >    window it had to read messages. This is because fetching the offsets
> is
> >    done by the worker running the Reader. This means that each Reader
> sees
> > a
> >    different state of "latest" (for his partition/s), such that a failing
> >    Reader that hasn't read yet might fetch a different "latest" once it's
> >    recovered then what it originally fetched. While this may not be as
> > painful
> >    for other runners, IMHO it lacks correctness and I'd suggest either
> > reading
> >    Kafka metadata of the Kafka cluster once upon initial splitting, or
> add
> >    some of it to the CheckpointMark. Filed BEAM-704
> >    <https://issues.apache.org/jira/browse/BEAM-704>.
> >
>
> +1. This is a great point. The notion that a runner may stop a reader and
> resume it from a checkpoint frequently is definitely part of the Beam model
> -- right now Spark and Direct runners (at least) do it very often. The
> current behavior is definitely, if not broken... unexpected.
>
> Both proposed solutions make sense to me -- either log the last offset for
> all partitions during splitting, or simply log the previous offset in the
> checkpoint mark when we start reading for the first time.
>
>

Re: [DISCUSS] UnboundedSource and the KafkaIO.

Reply via email to