[ https://issues.apache.org/jira/browse/KAFKA-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17949009#comment-17949009 ]
Travis Bischel edited comment on KAFKA-19233 at 5/2/25 8:47 PM: ---------------------------------------------------------------- What has you thinking (1) did not happen? In the broker logs, line 2 shows the broker replying to the client with FENCED_LEADER_EPOCH (errorCode 110). * Line 1: CONSUMER_GROUP_HEARTBEAT, member uxNP... epoch 10. The broker replies a successful response. This is a full request (all fields present). * Line 2, simulating the client disconnected and lost the response: CONSUMER_GROUP_HEARTBEAT for member uxNP epoch 10. The broker replies errorCode 110, FENCED_MEMBER_EPOCH. This is a full request (all fields present). * Line 3, repeated full request again: CONSUMER_GROUP_HEARTBEAT uxNP e10. The broker replies FENCED_MEMBER_EPOCH * Line 4, Client resets the epoch to 0 and sends a full request – all owned topics are lost. CONSUMER_GROUP_HEARTBEAT uxNP e0. The broker replies FENCED_MEMBER_EPOCH. The client then repeatedly retries epoch 0 with full requests. was (Author: twmb): What has you thinking (1) did not happen? In the broker logs, line 2 shows the broker replying to the client with FENCED_LEADER_EPOCH (errorCode 110). * Line 1: CONSUMER_GROUP_HEARTBEAT, member uxNP... epoch 10. The broker replies a successful response. * Line 2, simulating the client disconnected and lost the response: CONSUMER_GROUP_HEARTBEAT for member uxNP epoch 10. The broker replies errorCode 110, FENCED_MEMBER_EPOCH. * Line 3, repeated request again: CONSUMER_GROUP_HEARTBEAT uxNP e10. The broker replies FENCED_MEMBER_EPOCH * Line 4, Client resets the epoch to 0. CONSUMER_GROUP_HEARTBEAT uxNP e0. The broker replies FENCED_MEMBER_EPOCH. The client then repeatedly retries epoch 0. > Members cannot rejoin with epoch=0 for KIP-848 > ---------------------------------------------- > > Key: KAFKA-19233 > URL: https://issues.apache.org/jira/browse/KAFKA-19233 > Project: Kafka > Issue Type: Bug > Components: clients, consumer > Reporter: Travis Bischel > Priority: Major > Attachments: logs1 > > > If a group is on generation > 1 and a member is fenced, the member cannot > rejoin until the broker expires the member from the group. > KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH > error, the consumer abandon all its partitions and rejoins with the same > member id and the epoch 0.". > However, the current implementation on the broker throws FENCED_LEADER_EPOCH > if the client provided epoch, when not equal to the current epoch, is > anything other than the current epoch - 1. > Specifically this line: > https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535 > If the current epoch is 13, and I reset to epoch 0, the conditional always > throws FENCED_LEADER_EPOCH. > Attached are logs of this case, here is a sample of a single log line > demonstrating the problem: > {code} > 2025-05-02 15:23:09,304 > [data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG > kafka.request.logger - Completed > request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The > consumer group member has a smaller member epoch (0) than the one known by > the group coordinator (11). The member must abandon all its partitions and > rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}} > {code} > The logs show the broker continuously responding errcode 110 for 50s until, > I'm assuming, some condition boots the member from the group, such that the > next time the broker receives the request, the member is considered new and > the request is successful. > The first heartbeat is duplicated; I noticed that Kafka replies with > FENCED_LEADER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm > trying to see if it's possible to work around that. As an aside, between the > fenced error happening _a lot_, this issue, and KAFKA-19222, I'm leaning to > not opt into KIP-848 by default until the broker implementation improves. -- This message was sent by Atlassian Jira (v8.20.10#820010)