Regd BEAM-704 : sent a PR https://github.com/apache/incubator-beam/pull/1071 for setting 'consumedOffset' in the reader without waiting to read the first record.
On Fri, Oct 7, 2016 at 5:28 PM, Dan Halperin <[email protected]> wrote: > > 4. Reading/persisting Kafka start offsets - since Spark works in > > micro-batches, if "latest" was applied on a fairly sparse topic each > > worker > > would actually begin reading only after it saw a message during the > time > > window it had to read messages. This is because fetching the offsets > is > > done by the worker running the Reader. This means that each Reader > sees > > a > > different state of "latest" (for his partition/s), such that a failing > > Reader that hasn't read yet might fetch a different "latest" once it's > > recovered then what it originally fetched. While this may not be as > > painful > > for other runners, IMHO it lacks correctness and I'd suggest either > > reading > > Kafka metadata of the Kafka cluster once upon initial splitting, or > add > > some of it to the CheckpointMark. Filed BEAM-704 > > <https://issues.apache.org/jira/browse/BEAM-704>. > > > > +1. This is a great point. The notion that a runner may stop a reader and > resume it from a checkpoint frequently is definitely part of the Beam model > -- right now Spark and Direct runners (at least) do it very often. The > current behavior is definitely, if not broken... unexpected. > > Both proposed solutions make sense to me -- either log the last offset for > all partitions during splitting, or simply log the previous offset in the > checkpoint mark when we start reading for the first time. > >
