[
https://issues.apache.org/jira/browse/KAFKA-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Jacot resolved KAFKA-13790.
---------------------------------
Fix Version/s: 3.3.0
Reviewer: Jason Gustafson
Resolution: Fixed
> ReplicaManager should be robust to all partition updates from kraft metadata
> log
> --------------------------------------------------------------------------------
>
> Key: KAFKA-13790
> URL: https://issues.apache.org/jira/browse/KAFKA-13790
> Project: Kafka
> Issue Type: Bug
> Reporter: Jason Gustafson
> Assignee: David Jacot
> Priority: Major
> Fix For: 3.3.0
>
>
> There are two ways that partition state can be updated in the zk world: one
> is through `LeaderAndIsr` requests and one is through `AlterPartition`
> responses. All changes made to partition state result in new LeaderAndIsr
> requests, but replicas will ignore them if the leader epoch is less than or
> equal to the current known leader epoch. Basically it works like this:
> * Changes made by the leader are done through AlterPartition requests. These
> changes bump the partition epoch (or zk version), but leave the leader epoch
> unchanged. LeaderAndIsr requests are sent by the controller, but replicas
> ignore them. Partition state is instead only updated when the AlterIsr
> response is received.
> * Changes made by the controller are made directly by the controller and
> always result in a leader epoch bump. These changes are sent to replicas
> through LeaderAndIsr requests and are applied by replicas.
> The code in `kafka.server.ReplicaManager` and `kafka.cluster.Partition` are
> built on top of these assumptions. The logic in `makeLeader`, for example,
> assumes that the leader epoch has indeed been bumped. Specifically, follower
> state gets reset and a new entry is written to the leader epoch cache.
> In KRaft, we also have two paths to update partition state. One is
> AlterPartition, just like in the zk world. The second is updates received
> from the metadata log. These follow the same path as LeaderAndIsr requests
> for the most part, but a big difference is that all changes are sent down to
> `kafka.cluster.Partition`, even those which do not have a bumped leader
> epoch. This breaks the assumptions mentioned above in `makeLeader`, which
> could result in leader epoch cache inconsistency. Another side effect of this
> on the follower side is that replica fetchers for updated partitions get
> unnecessarily restarted. There may be others as well.
> We need to either replicate the same logic on the zookeeper side or make the
> logic robust to all updates including those without a leader epoch bump.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)