[
https://issues.apache.org/jira/browse/KAFKA-19235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051494#comment-18051494
]
David Jacot commented on KAFKA-19235:
-------------------------------------
Hi [~twmb]. Thanks for reporting this issue. We also noticed the same. This is
basically a flaw in the design of KIP-848 in the sense that it makes the client
side reasoning too hard. In order to improve this, we have proposed
[KIP-1251|https://cwiki.apache.org/confluence/display/KAFKA/KIP-1251%3A+Assignment+epochs+for+consumer+groups].
The KIP will basically avoid returning STALE_MEMBER_EPOCH. Take a look at the
KIP and let us know what you think. We plan to ship it in 4.3.
> STALE_MEMBER_EPOCH is mostly non-recoverable and forces lost commits when
> leaving a group (KIP-848)
> ---------------------------------------------------------------------------------------------------
>
> Key: KAFKA-19235
> URL: https://issues.apache.org/jira/browse/KAFKA-19235
> Project: Kafka
> Issue Type: Bug
> Components: clients, consumer
> Affects Versions: 4.0.0
> Reporter: Travis Bischel
> Priority: Major
>
> Flow:
> * I heartbeat and receive memberEpoch 7, heartbeat interval 5s
> * 3s later I want to leave the group
> * In my OnRevoke before leaving, I commit offsets
> * The broker has bumped the memberEpoch
> * My OffsetCommit request fails with STALE_MEMBER_EPOCH
> I am leaving the group, there will be no future heartbeat (besides the one
> actually leaving the group with memberEpoch -1 or -2) to get a new epoch so
> that I can issue a final commit.
> What I've tried to do locally is force an inline ConsumerGroupHeartbeat if I
> receive STALE_MEMBER_EPOCH from an OffsetCommit response and then reissue the
> commit request. Well, Kafka 4 returns FENCED_MEMBER_EPOCH _a lot_, and
> frequently this forced ConsumerGroupHeartbeat receives FENCED_MEMBER_EPOCH,
> and thus I cannot update the epoch.
>
> Clients are meant to give up all partitions if they experience
> FENCED_MEMBER_EPOCH and rejoin with a MemberEpoch of 0. Well, we're already
> in the process of giving up partitions. The commit just can't go through.
>
> The Java client looks to just blindly retry the commit without doing anything
> with the epoch (likely the epoch is handled elsewhere – and, unless something
> shows me otherwise, the Java client should also be experiencing the
> FENCED_MEMBER_EPOCH problem if this is being handled elsewhere):
> [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java#L346-L352]
> There are some tests in the Java client codebase, but they do not actually
> test if the commit is successful. The tests simply check that the commit is
> scheduled to be retried:
> [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/test/java/org/apache/kafka/clients/consumer/internals/CommitRequestManagerTest.java#L481-L485]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)