Sophie Blee-Goldman created KAFKA-9140:
------------------------------------------
Summary: Consumer gets stuck rejoining the group indefinitely
Key: KAFKA-9140
URL: https://issues.apache.org/jira/browse/KAFKA-9140
Project: Kafka
Issue Type: Bug
Components: clients, consumer
Affects Versions: 2.4.0
Reporter: Sophie Blee-Goldman
There seems to be a race condition that is now causing a rejoining member to
potentially get stuck infinitely initiating a rejoin. The relevant logs are
attached, but basically it repeats this message (and nothing else) continuously
until killed/shutdown:
{code:java}
[2019-11-05 01:53:54,699] INFO [Consumer
clientId=StreamsUpgradeTest-a4c1cff8-7883-49cd-82da-d2cdfc33a2f0-StreamThread-1-consumer,
groupId=StreamsUpgradeTest] Generation data was cleared by heartbeat thread.
Initiating rejoin.
(org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
{code}
The message that appears was added as part of the bugfix (PR #7460) for this
related race condition: KAFKA-8104.
This issue was uncovered by the Streams version probing upgrade test, which
fails with a varying frequency. Here is the rate of failures for different
system test runs so far:
trunk (cooperative): 1/1 and 2/10 failures
2.4 (cooperative) : 0/10 and 1/15 failures
trunk (eager): 0/10 failures
I've kicked off some high-repeat runs to complete overnight and hopefully shed
more light.
Note that I have also kicked off runs of both 2.4 and trunk with the PR for
KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug
that was fixed by PR #7460. It is therefore unclear whether PR #7460 introduced
another or a new race condition/bug, or merely uncovered an existing one that
previously would have first failed due to KAFKA-8104.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)