[ 
https://issues.apache.org/jira/browse/KAFKA-19235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051494#comment-18051494
 ] 

David Jacot commented on KAFKA-19235:
-------------------------------------

Hi [~twmb]. Thanks for reporting this issue. We also noticed the same. This is 
basically a flaw in the design of KIP-848 in the sense that it makes the client 
side reasoning too hard. In order to improve this, we have proposed 
[KIP-1251|https://cwiki.apache.org/confluence/display/KAFKA/KIP-1251%3A+Assignment+epochs+for+consumer+groups].
 The KIP will basically avoid returning STALE_MEMBER_EPOCH. Take a look at the 
KIP and let us know what you think. We plan to ship it in 4.3.

> STALE_MEMBER_EPOCH is mostly non-recoverable and forces lost commits when 
> leaving a group (KIP-848)
> ---------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-19235
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19235
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 4.0.0
>            Reporter: Travis Bischel
>            Priority: Major
>
> Flow:
> * I heartbeat and receive memberEpoch 7, heartbeat interval 5s
> * 3s later I want to leave the group
> * In my OnRevoke before leaving, I commit offsets
> * The broker has bumped the memberEpoch
> * My OffsetCommit request fails with STALE_MEMBER_EPOCH
> I am leaving the group, there will be no future heartbeat (besides the one 
> actually leaving the group with memberEpoch -1 or -2) to get a new epoch so 
> that I can issue a final commit.
> What I've tried to do locally is force an inline ConsumerGroupHeartbeat if I 
> receive STALE_MEMBER_EPOCH from an OffsetCommit response and then reissue the 
> commit request. Well, Kafka 4 returns FENCED_MEMBER_EPOCH _a lot_, and 
> frequently this forced ConsumerGroupHeartbeat receives FENCED_MEMBER_EPOCH, 
> and thus I cannot update the epoch.
>  
> Clients are meant to give up all partitions if they experience 
> FENCED_MEMBER_EPOCH and rejoin with a MemberEpoch of 0. Well, we're already 
> in the process of giving up partitions. The commit just can't go through.
>  
> The Java client looks to just blindly retry the commit without doing anything 
> with the epoch (likely the epoch is handled elsewhere – and, unless something 
> shows me otherwise, the Java client should also be experiencing the 
> FENCED_MEMBER_EPOCH problem if this is being handled elsewhere):
> [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java#L346-L352]
> There are some tests in the Java client codebase, but they do not actually 
> test if the commit is successful. The tests simply check that the commit is 
> scheduled to be retried:
> [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/test/java/org/apache/kafka/clients/consumer/internals/CommitRequestManagerTest.java#L481-L485]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to