[ 
https://issues.apache.org/jira/browse/BEAM-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Beam JIRA Bot updated BEAM-3025:
--------------------------------
    Labels: stale-P2  (was: )

> Caveats with KafkaIO exactly once support.
> ------------------------------------------
>
>                 Key: BEAM-3025
>                 URL: https://issues.apache.org/jira/browse/BEAM-3025
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-kafka
>    Affects Versions: Not applicable
>            Reporter: Raghu Angadi
>            Priority: P2
>              Labels: stale-P2
>
> BEAM-2720 adds support for exactly-once semantics in KafkaIO sink. It is 
> tested with Dataflow and seems to work well. It does some with a few caveats 
> we should address over time:
>   * Implementation requires specific durability/checkpoint semantics across a 
> GroupByKey transform.
>        ** It requires a checkpoint across GBK. Not all runners support this, 
> specifically horizontal distributed checkpoint  in Flink does not work.
>        ** The implementation includes runtime check for compatibility.
>    * Requires stateful DoFn support. Not all runners support this. Even those 
> that do, often includes their own caveats. This is part of core Beam model, 
> and overtime it will be widely supported across the runners.
>     * The user records go through extra shuffles. The implementation results 
> in 2 extra shuffles in Dataflow. Some enhancements to Beam API might reduce 
> number of shuffles.
>    * It requires user to specify 'number of shards', which determines sink 
> parallelism. The shard ids are also used to store some stage in topic 
> metadata on Kafka servers. If the number of shards is larger than the number 
> of partitions for the output topic, the behavior is not documented, though 
> tests seem to work fine. I.e. I am able to store metadata for 100 partitions 
> even though a topic has just 10 partitions. We should probably file a jira 
> for Kafka. Alternately we could limit number of shards to be fewer than the 
> number of partitions (not required yet).
>    * The metadata mentioned above is kept only for 24 hours by default. i.e., 
> if a pipeline does not write anything for a day or is down for a day, it 
> could lose crucial state stored with Kafka. Admin can configure this on Kafka 
> servers to be larger, but there is no way for a client to increase it for 
> specific topic. Note that Kafka Streams also face the same issue.
>        ** I asked about both of Kafka issue on user list : 
> https://www.mail-archive.com/[email protected]/msg27624.html . 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to