[
https://issues.apache.org/jira/browse/BEAM-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Beam JIRA Bot updated BEAM-3025:
--------------------------------
Labels: stale-P2 (was: )
> Caveats with KafkaIO exactly once support.
> ------------------------------------------
>
> Key: BEAM-3025
> URL: https://issues.apache.org/jira/browse/BEAM-3025
> Project: Beam
> Issue Type: Bug
> Components: io-java-kafka
> Affects Versions: Not applicable
> Reporter: Raghu Angadi
> Priority: P2
> Labels: stale-P2
>
> BEAM-2720 adds support for exactly-once semantics in KafkaIO sink. It is
> tested with Dataflow and seems to work well. It does some with a few caveats
> we should address over time:
> * Implementation requires specific durability/checkpoint semantics across a
> GroupByKey transform.
> ** It requires a checkpoint across GBK. Not all runners support this,
> specifically horizontal distributed checkpoint in Flink does not work.
> ** The implementation includes runtime check for compatibility.
> * Requires stateful DoFn support. Not all runners support this. Even those
> that do, often includes their own caveats. This is part of core Beam model,
> and overtime it will be widely supported across the runners.
> * The user records go through extra shuffles. The implementation results
> in 2 extra shuffles in Dataflow. Some enhancements to Beam API might reduce
> number of shuffles.
> * It requires user to specify 'number of shards', which determines sink
> parallelism. The shard ids are also used to store some stage in topic
> metadata on Kafka servers. If the number of shards is larger than the number
> of partitions for the output topic, the behavior is not documented, though
> tests seem to work fine. I.e. I am able to store metadata for 100 partitions
> even though a topic has just 10 partitions. We should probably file a jira
> for Kafka. Alternately we could limit number of shards to be fewer than the
> number of partitions (not required yet).
> * The metadata mentioned above is kept only for 24 hours by default. i.e.,
> if a pipeline does not write anything for a day or is down for a day, it
> could lose crucial state stored with Kafka. Admin can configure this on Kafka
> servers to be larger, but there is no way for a client to increase it for
> specific topic. Note that Kafka Streams also face the same issue.
> ** I asked about both of Kafka issue on user list :
> https://www.mail-archive.com/[email protected]/msg27624.html .
--
This message was sent by Atlassian Jira
(v8.3.4#803005)