[jira] [Commented] (KAFKA-19233) Members cannot rejoin with epoch=0 for KIP-848

Lianet Magrans (Jira) Wed, 21 May 2025 09:27:23 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953199#comment-17953199
 ]


Lianet Magrans commented on KAFKA-19233:
----------------------------------------

Hey Travis, could you double check if you're sending the assigned partitions 
when you're re-issuing the HB request? and if the client had no assignment yet 
you should be sending a empty list of partitions as a full HB (not null), 
please check and let me know. 

This is required so that the coordinator can let a member back in after a 
fencing it epoch (ex. on lost HB response). This coordinator check you were 
sharing:
https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535
 
Also note that it checks that the member comes with the previous epoch it had 
(expecting that the client may loose one HB response with a bumped epoch, then 
rejoin with the previous epoch).

The java client deals with this situation by ensuring :
- sends only 1 HB at a time (until response from broker or timeout/error on 
request)
- on retriable errors (ex. timeout on no reponse) it resends a full HB 
(including the assigned partitions, ex. empty list if first request got no 
response), so the check on the coordinator allows it back it

Let me know if this helps, and I will definitely clarify on the KIP Fencing 
section (we were also mixing on this jira 2 fencing scenarios: fenced member vs 
fenced epoch). This issue is about fenced epoch, will clarify it all on the 
doc.  


> Members cannot rejoin with epoch=0 for KIP-848
> ----------------------------------------------
>
>                 Key: KAFKA-19233
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19233
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>            Reporter: Travis Bischel
>            Priority: Major
>         Attachments: logs1
>
>
> If a group is on generation > 1 and a member is fenced, the member cannot 
> rejoin until the broker expires the member from the group.
> KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH 
> error, the consumer abandon all its partitions and rejoins with the same 
> member id and the epoch 0.".
> However, the current implementation on the broker throws FENCED_MEMBER_EPOCH 
> if the client provided epoch, when not equal to the current epoch, is 
> anything other than the current epoch - 1.
> Specifically this line: 
> [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535]
> If the current epoch is 13, and I reset to epoch 0, the conditional always 
> throws FENCED_MEMBER_EPOCH.
> Attached are logs of this case, here is a sample of a single log line 
> demonstrating the problem:
> {code:java}
> 2025-05-02 15:23:09,304 
> [data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG 
> kafka.request.logger - Completed 
> request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
>  consumer group member has a smaller member epoch (0) than the one known by 
> the group coordinator (11). The member must abandon all its partitions and 
> rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
> {code}
> The logs show the broker continuously responding errcode 110 for 50s until, 
> I'm assuming, some condition boots the member from the group, such that the 
> next time the broker receives the request, the member is considered new and 
> the request is successful.
> The first heartbeat is duplicated; I noticed that Kafka replies with 
> FENCED_MEMBER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm 
> trying to see if it's possible to work around that. As an aside, between the 
> fenced error happening {_}a lot{_}, this issue, and KAFKA-19222, I'm leaning 
> to not opt into KIP-848 by default until the broker implementation improves.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-19233) Members cannot rejoin with epoch=0 for KIP-848

Reply via email to