[ https://issues.apache.org/jira/browse/KAFKA-13979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fathima Khazana Abdul Haiyum resolved KAFKA-13979. -------------------------------------------------- Resolution: Not A Bug > Kafka resets committed offset after rebalance > --------------------------------------------- > > Key: KAFKA-13979 > URL: https://issues.apache.org/jira/browse/KAFKA-13979 > Project: Kafka > Issue Type: Bug > Affects Versions: 2.6.2 > Reporter: Fathima Khazana Abdul Haiyum > Priority: Critical > > We have 3 nodes in our MSK cluster which run Apache Kafka 2.6.2. We have 15 > partitions for a topic and 5 consumers in our consumer group, where each > consumer runs on it's own java application server. Whenever we > deploy(rolling) to our servers, we notice a huge consumer lag on *some* of > the 15 partitions. It appears that the consumer after rebalancing resets its > committed offset and reprocesses messages. For example: this is what I'm > seeing: > {code:java} > logger_name:org.apache.kafka.clients.consumer.internals.ConsumerCoordinator > message:[Consumer clientId=myService-mytopic-0, groupId=myService-mytopic] > Committed offset 3044 for partition mytopic-0{code} > > So we know for a fact that the offset 3044 has been committed for partition 0. > > Running {{./kafka-consumer-groups.sh --describe}} gives the following: > {code:java} > GROUP PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID CLIENT-ID > myService-mytopic 0 3044 3044 0 myService-mytopic-0 > {code} > {{ }} > After a deploy, which removes the consumer from the group and triggers a > rebalance + adds the consumer back, I see this: > {code:java} > GROUP PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID CLIENT-ID > myService-mytopic 0 1890 3047 1157 myService-mytopic-0{code} > > In the application logs, I see this: > {code:java} > logger_name:org.apache.kafka.clients.consumer.internals.Fetcher > message:[Consumer clientId=myService-mytopic-0, groupId=myService-mytopic] > Fetch position FetchPosition{offset=1890, offsetEpoch=Optional.empty, > currentLeader=LeaderAndEpoch{leader=Optional[b-3.kafka-mytestserver.1gkwlu.c16.kafka.us-east-1.amazonaws.com:9098 > (id: 3 rack: use1-az1)], epoch=0 is out of range for partition mytopic-0, > resetting offset}}{code} > Why is kafka fetching the current-offset 1890 which is before the committed > offset for the partition after rebalance? This is on a test environment where > less than 1 message is produced per second. This issue occurs for both auto > commit (default interval) and manual commit mechanisms and on kafka versions > 2.6.2 and 2.8.1. On production, we have much more traffic and causes > reprocessing of around 2 million messages per partition. > {{auto.offset.reset=latest}} and {{retention.ms=1000}} if that matters. We're > using the java client {{kafka-clients}} version 3.0.0. > The five consumers have the same {{{}client.id{}}}. > -- This message was sent by Atlassian Jira (v8.20.7#820007)