[jira] [Commented] (BEAM-3025) Caveats with KafkaIO exactly once support.

Beam JIRA Bot (Jira) Wed, 10 Jun 2020 10:37:35 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132335#comment-17132335
 ]


Beam JIRA Bot commented on BEAM-3025:
-------------------------------------

This issue is assigned but has not received an update in 30 days so it has been 
labeled "stale-assigned". If you are still working on the issue, please give an 
update and remove the label. If you are no longer working on the issue, please 
unassign so someone else may work on it. In 7 days the issue will be 
automatically unassigned.

> Caveats with KafkaIO exactly once support.
> ------------------------------------------
>
>                 Key: BEAM-3025
>                 URL: https://issues.apache.org/jira/browse/BEAM-3025
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-kafka
>    Affects Versions: Not applicable
>            Reporter: Raghu Angadi
>            Assignee: Raghu Angadi
>            Priority: P2
>              Labels: stale-assigned
>
> BEAM-2720 adds support for exactly-once semantics in KafkaIO sink. It is 
> tested with Dataflow and seems to work well. It does some with a few caveats 
> we should address over time:
>   * Implementation requires specific durability/checkpoint semantics across a 
> GroupByKey transform.
>        ** It requires a checkpoint across GBK. Not all runners support this, 
> specifically horizontal distributed checkpoint  in Flink does not work.
>        ** The implementation includes runtime check for compatibility.
>    * Requires stateful DoFn support. Not all runners support this. Even those 
> that do, often includes their own caveats. This is part of core Beam model, 
> and overtime it will be widely supported across the runners.
>     * The user records go through extra shuffles. The implementation results 
> in 2 extra shuffles in Dataflow. Some enhancements to Beam API might reduce 
> number of shuffles.
>    * It requires user to specify 'number of shards', which determines sink 
> parallelism. The shard ids are also used to store some stage in topic 
> metadata on Kafka servers. If the number of shards is larger than the number 
> of partitions for the output topic, the behavior is not documented, though 
> tests seem to work fine. I.e. I am able to store metadata for 100 partitions 
> even though a topic has just 10 partitions. We should probably file a jira 
> for Kafka. Alternately we could limit number of shards to be fewer than the 
> number of partitions (not required yet).
>    * The metadata mentioned above is kept only for 24 hours by default. i.e., 
> if a pipeline does not write anything for a day or is down for a day, it 
> could lose crucial state stored with Kafka. Admin can configure this on Kafka 
> servers to be larger, but there is no way for a client to increase it for 
> specific topic. Note that Kafka Streams also face the same issue.
>        ** I asked about both of Kafka issue on user list : 
> https://www.mail-archive.com/[email protected]/msg27624.html . 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-3025) Caveats with KafkaIO exactly once support.

Reply via email to