[ https://issues.apache.org/jira/browse/KAFKA-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mickael Maison reassigned KAFKA-10048: -------------------------------------- Assignee: Andre Araujo > Possible data gap for a consumer after a failover when using MM2 > ---------------------------------------------------------------- > > Key: KAFKA-10048 > URL: https://issues.apache.org/jira/browse/KAFKA-10048 > Project: Kafka > Issue Type: Bug > Components: mirrormaker > Affects Versions: 2.5.0 > Reporter: Andre Araujo > Assignee: Andre Araujo > Priority: Major > > I've been looking at some MM2 scenarios and identified a situation where > consumers can miss consuming some data in the even of a failover. > > When a consumer subscribes to a topic for the first time and commits offsets, > the offsets for every existing partition of that topic will be saved to the > cluster's {{__consumer_offset}} topic. Even if a partition is completely > empty, the offset {{0}} will still be saved for the consumer's consumer group. > > When MM2 is replicating the checkpoints to the remote cluster, though, it > [ignores anything that has an offset equals to > zero|https://github.com/apache/kafka/blob/856e36651203b03bf9a6df2f2d85a356644cbce3/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointTask.java#L135], > replicating offsets only for partitions that contain data. > > This can lead to a gap in the data consumed by consumers in the following > scenario: > # Topic is created on the source cluster. > # MM2 is configured to replicate the topic and consumer groups > # Producer starts to produce data to the source topic but for some reason > some partitions do not get data initially, while others do (skewed keyed > messages or bad luck) > # Consumers start to consume data from that topic and their consumer groups' > offsets are replicated to the target cluster, *but only for partitions that > contain data*. The consumers are using the default setting auto.offset.reset > = latest. > # A consumer failover to the second cluster is performed (for whatever > reason), and the offset translation steps are completed. The consumer are not > restarted yet. > # The producers continue to produce data to the source cluster topic and now > produce data to the partitions that were empty before. > # *After* the producers start producing data, consumers are started on the > target cluster and start consuming. > For the partitions that already had data before the failover, everything > works fine. The consumer offsets will have been translated correctly and the > consumers will start consuming from the correct position. > For the partitions that were empty before the failover, though, any data > written by the producers to those partitions *after the failover but before > the consumers start* will be completely missed, since the consumers will jump > straight to the latest offset when they start due to the lack of a zero > offset stored locally on the target cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005)