[ 
https://issues.apache.org/jira/browse/KAFKA-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143168#comment-17143168
 ] 

Gokul Ramanan Subramanian commented on KAFKA-10105:
---------------------------------------------------

[~ableegoldman], we are unable to reproduce the bug systematically. It shows up 
once in a while. We had a few streams applications using the official Java 
2.4.1 client on a 2.4.1 cluster. Each streams application uses a few threads. 
All the threads across all these applications try to form a single consumer 
group. Not sure what exactly causes the issue, but over a day past shutting 
down the applications, the group coordinator still has some zombie members in 
its metadata. It takes a restart to fix this. If the delayed heartbeat 
operation were functioning as expected, I simply cannot understand how these 
zombie members are still stuck in the coordinator. Wouldn't the coordinator 
remove them from the group? Further, even when starting up the streams 
applications again (after the 1 day hiatus), the zombie members continue to be 
in the group metadata. The group continues to be in PendingRebalance state. 
Wouldn't the GroupCoordinator.onJoinComplete (which should eventually be 
triggered after the DelayedJoin expiration) ensure that zombie members which 
probably did not send any JoinGroup requests in the meantime are kicked out of 
the member list?

Theoretically speaking, what would it take for Kafka to reach this state that 
is as good as a consumer live lock?

> Regression in group coordinator dealing with flaky clients joining while 
> leaving
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-10105
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10105
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.1
>         Environment: Kafka 2.4.1 on jre 11 on debian 9 in docker
>            Reporter: William Reynolds
>            Priority: Major
>
> Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals 
> correctly with a consumer sending a join after a leave correctly.
> What happens no is that if a consumer sends a leaving then follows up by 
> trying to send a join again as it is shutting down the group coordinator adds 
> the leaving member to the group but never seems to heartbeat that member.
> Since the consumer is then gone when it joins again after starting it is 
> added as a new member but the zombie member is there and is included in the 
> partition assignment which means that those partitions never get consumed 
> from. What can also happen is that one of the zombies gets group leader so 
> rebalance gets stuck forever and the group is entirely blocked.
> I have not been able to track down where this got introduced between 1.1.0 
> and 2.4.1 but I will look further into this. Unfortunately the logs are 
> essentially silent about the zombie mebers and I only had INFO level logging 
> on during the issue and by stopping all the consumers in the group and 
> restarting the broker coordinating that group we could get back to a working 
> state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to