[ https://issues.apache.org/jira/browse/KAFKA-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967211#comment-16967211 ]
ASF GitHub Bot commented on KAFKA-9140: --------------------------------------- guozhangwang commented on pull request #7647: KAFKA-9140: Also reset join future when generation was reset in order to re-join URL: https://github.com/apache/kafka/pull/7647 Otherwise the join-group would not be resend and we'd just fall into the endless loop. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consumer gets stuck rejoining the group indefinitely > ---------------------------------------------------- > > Key: KAFKA-9140 > URL: https://issues.apache.org/jira/browse/KAFKA-9140 > Project: Kafka > Issue Type: Bug > Components: clients, consumer > Affects Versions: 2.4.0 > Reporter: Sophie Blee-Goldman > Priority: Blocker > Attachments: debug.tgz, info.tgz, kafka-data-logs-1.tgz, > kafka-data-logs-2.tgz, server-start-stdout-stderr.log.tgz, streams.log.tgz > > > There seems to be a race condition that is now causing a rejoining member to > potentially get stuck infinitely initiating a rejoin. The relevant client > logs are attached (streams.log.tgz; all others attachments are broker logs), > but basically it repeats this message (and nothing else) continuously until > killed/shutdown: > > {code:java} > [2019-11-05 01:53:54,699] INFO [Consumer > clientId=StreamsUpgradeTest-a4c1cff8-7883-49cd-82da-d2cdfc33a2f0-StreamThread-1-consumer, > groupId=StreamsUpgradeTest] Generation data was cleared by heartbeat thread. > Initiating rejoin. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > {code} > > The message that appears was added as part of the bugfix ([PR > 7460|https://github.com/apache/kafka/pull/7460]) for this related race > condition: KAFKA-8104. > This issue was uncovered by the Streams version probing upgrade test, which > fails with a varying frequency. Here is the rate of failures for different > system test runs so far: > trunk (cooperative): 1/1 and 2/10 failures > 2.4 (cooperative) : 0/10 and 1/15 failures > trunk (eager): 0/10 failures > I've kicked off some high-repeat runs to complete overnight and hopefully > shed more light. > Note that I have also kicked off runs of both 2.4 and trunk with the PR for > KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug > that was fixed by [PR 7460|https://github.com/apache/kafka/pull/7460]. It is > therefore unclear whether [PR 7460|https://github.com/apache/kafka/pull/7460] > introduced another or a new race condition/bug, or merely uncovered an > existing one that previously would have first failed due to KAFKA-8104. > -- This message was sent by Atlassian Jira (v8.3.4#803005)