[ 
https://issues.apache.org/jira/browse/KAFKA-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142608#comment-17142608
 ] 

William Reynolds commented on KAFKA-10105:
------------------------------------------

[~gokul2411s]the reference to 1.1.0 was due to that be the version that we were 
on before upgrade to 2.4.1. It is a bit confusing now I look at it, apologies.

 

[~ableegoldman] the consumers were the ruby-kafka clients [~theturtle32] 
described earlier which aren't very tightly coded to keep up with kafka version 
changes ([https://github.com/zendesk/ruby-kafka)], it just uses the new 
consumer. I think KAFKA-9935 was with one of the official clients so perhaps 
digging into that would make for an easier reproduction. If that doesn't pan 
out between brian and me I believe we could outline the ruby kafka steps to 
reproduce also.

> Regression in group coordinator dealing with flaky clients joining while 
> leaving
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-10105
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10105
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.1
>         Environment: Kafka 1.1.0 on jre 8 on debian 9 in docker
> Kafka 2.4.1 on jre 11 on debian 9 in docker
>            Reporter: William Reynolds
>            Priority: Major
>
> Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals 
> correctly with a consumer sending a join after a leave correctly.
> What happens no is that if a consumer sends a leaving then follows up by 
> trying to send a join again as it is shutting down the group coordinator adds 
> the leaving member to the group but never seems to heartbeat that member.
> Since the consumer is then gone when it joins again after starting it is 
> added as a new member but the zombie member is there and is included in the 
> partition assignment which means that those partitions never get consumed 
> from. What can also happen is that one of the zombies gets group leader so 
> rebalance gets stuck forever and the group is entirely blocked.
> I have not been able to track down where this got introduced between 1.1.0 
> and 2.4.1 but I will look further into this. Unfortunately the logs are 
> essentially silent about the zombie mebers and I only had INFO level logging 
> on during the issue and by stopping all the consumers in the group and 
> restarting the broker coordinating that group we could get back to a working 
> state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to