[ 
https://issues.apache.org/jira/browse/KAFKA-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sophie Blee-Goldman updated KAFKA-9140:
---------------------------------------
    Priority: Blocker  (was: Critical)

> Consumer gets stuck rejoining the group indefinitely
> ----------------------------------------------------
>
>                 Key: KAFKA-9140
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9140
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 2.4.0
>            Reporter: Sophie Blee-Goldman
>            Priority: Blocker
>
> There seems to be a race condition that is now causing a rejoining member to 
> potentially get stuck infinitely initiating a rejoin. The relevant logs are 
> attached, but basically it repeats this message (and nothing else) 
> continuously until killed/shutdown:
>  
> {code:java}
> [2019-11-05 01:53:54,699] INFO [Consumer 
> clientId=StreamsUpgradeTest-a4c1cff8-7883-49cd-82da-d2cdfc33a2f0-StreamThread-1-consumer,
>  groupId=StreamsUpgradeTest] Generation data was cleared by heartbeat thread. 
> Initiating rejoin. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> {code}
>  
> The message that appears was added as part of the bugfix ([PR 
> 7460|https://github.com/apache/kafka/pull/7460]) for this related race 
> condition: KAFKA-8104.
> This issue was uncovered by the Streams version probing upgrade test, which 
> fails with a varying frequency. Here is the rate of failures for different 
> system test runs so far:
> trunk (cooperative): 1/1 and 2/10 failures
> 2.4 (cooperative) : 0/10 and 1/15 failures
> trunk (eager): 0/10 failures
> I've kicked off some high-repeat runs to complete overnight and hopefully 
> shed more light.
> Note that I have also kicked off runs of both 2.4 and trunk with the PR for 
> KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug 
> that was fixed by [PR 7460|https://github.com/apache/kafka/pull/7460]. It is 
> therefore unclear whether [PR 7460|https://github.com/apache/kafka/pull/7460] 
> introduced another or a new race condition/bug, or merely uncovered an 
> existing one that previously would have first failed due to KAFKA-8104.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to