[
https://issues.apache.org/jira/browse/BEAM-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15556749#comment-15556749
]
Raghu Angadi commented on BEAM-704:
-----------------------------------
https://github.com/apache/incubator-beam/pull/1071 : sets offset on the reader.
{quote} A read from Kafka requires to specify topic/s and either specific
partitions or "earliest/latest". {quote}
Thats is not true. It does not require either 'earliest' or 'latest'. 'latest'
is default. You can have a consumer-group id, in which case it would defailt to
what is committed for that consumer-id.
{quote} If we were to handle that on splitting, all Kafka reads would have a
"starting" CheckpointMark {quote}
That is not correct. IFAIK, Beam does not ask the reader for checkpoint (at
least Google Dataflow does not). getCheckpointMark() is only called on the
reader on the worker.
> KafkaIO should handle "latest offset" evenly, and persist it as part of the
> CheckpointMark.
> -------------------------------------------------------------------------------------------
>
> Key: BEAM-704
> URL: https://issues.apache.org/jira/browse/BEAM-704
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-extensions
> Reporter: Amit Sela
> Assignee: Raghu Angadi
>
> Currently, the KafkaIO (when configured to "latest") will check the latest
> offset on the worker. This means that each worker sees a "different" latest
> for the time it checks for the partitions assigned to it.
> This also means that if a worker fails before starting to read, and new
> messages were added in between, they would be missed.
> I think we should consider checking the offsets (could be the same for
> "earliest") when running initialSplits (that's how Spark does that as well,
> one call from the driver for all topic-partitions).
> I'd also suggest we persist the latest offset as part of the CheckpointMark
> so that once latest is set, it is remembered until new messages arrive and it
> doesn't need to be resolved again (and if there were new messages available
> they won't be missed upon failure).
> For Spark this is even more important as state is passed in-between
> micro-batches and sparse partitions may skip messages until a message finally
> arrives within the read time-frame.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)