[
https://issues.apache.org/jira/browse/BEAM-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132335#comment-17132335
]
Beam JIRA Bot commented on BEAM-3025:
-------------------------------------
This issue is assigned but has not received an update in 30 days so it has been
labeled "stale-assigned". If you are still working on the issue, please give an
update and remove the label. If you are no longer working on the issue, please
unassign so someone else may work on it. In 7 days the issue will be
automatically unassigned.
> Caveats with KafkaIO exactly once support.
> ------------------------------------------
>
> Key: BEAM-3025
> URL: https://issues.apache.org/jira/browse/BEAM-3025
> Project: Beam
> Issue Type: Bug
> Components: io-java-kafka
> Affects Versions: Not applicable
> Reporter: Raghu Angadi
> Assignee: Raghu Angadi
> Priority: P2
> Labels: stale-assigned
>
> BEAM-2720 adds support for exactly-once semantics in KafkaIO sink. It is
> tested with Dataflow and seems to work well. It does some with a few caveats
> we should address over time:
> * Implementation requires specific durability/checkpoint semantics across a
> GroupByKey transform.
> ** It requires a checkpoint across GBK. Not all runners support this,
> specifically horizontal distributed checkpoint in Flink does not work.
> ** The implementation includes runtime check for compatibility.
> * Requires stateful DoFn support. Not all runners support this. Even those
> that do, often includes their own caveats. This is part of core Beam model,
> and overtime it will be widely supported across the runners.
> * The user records go through extra shuffles. The implementation results
> in 2 extra shuffles in Dataflow. Some enhancements to Beam API might reduce
> number of shuffles.
> * It requires user to specify 'number of shards', which determines sink
> parallelism. The shard ids are also used to store some stage in topic
> metadata on Kafka servers. If the number of shards is larger than the number
> of partitions for the output topic, the behavior is not documented, though
> tests seem to work fine. I.e. I am able to store metadata for 100 partitions
> even though a topic has just 10 partitions. We should probably file a jira
> for Kafka. Alternately we could limit number of shards to be fewer than the
> number of partitions (not required yet).
> * The metadata mentioned above is kept only for 24 hours by default. i.e.,
> if a pipeline does not write anything for a day or is down for a day, it
> could lose crucial state stored with Kafka. Admin can configure this on Kafka
> servers to be larger, but there is no way for a client to increase it for
> specific topic. Note that Kafka Streams also face the same issue.
> ** I asked about both of Kafka issue on user list :
> https://www.mail-archive.com/[email protected]/msg27624.html .
--
This message was sent by Atlassian Jira
(v8.3.4#803005)