[ https://issues.apache.org/jira/browse/KAFKA-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142608#comment-17142608 ]
William Reynolds commented on KAFKA-10105: ------------------------------------------ [~gokul2411s]the reference to 1.1.0 was due to that be the version that we were on before upgrade to 2.4.1. It is a bit confusing now I look at it, apologies. [~ableegoldman] the consumers were the ruby-kafka clients [~theturtle32] described earlier which aren't very tightly coded to keep up with kafka version changes ([https://github.com/zendesk/ruby-kafka)], it just uses the new consumer. I think KAFKA-9935 was with one of the official clients so perhaps digging into that would make for an easier reproduction. If that doesn't pan out between brian and me I believe we could outline the ruby kafka steps to reproduce also. > Regression in group coordinator dealing with flaky clients joining while > leaving > -------------------------------------------------------------------------------- > > Key: KAFKA-10105 > URL: https://issues.apache.org/jira/browse/KAFKA-10105 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.4.1 > Environment: Kafka 1.1.0 on jre 8 on debian 9 in docker > Kafka 2.4.1 on jre 11 on debian 9 in docker > Reporter: William Reynolds > Priority: Major > > Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals > correctly with a consumer sending a join after a leave correctly. > What happens no is that if a consumer sends a leaving then follows up by > trying to send a join again as it is shutting down the group coordinator adds > the leaving member to the group but never seems to heartbeat that member. > Since the consumer is then gone when it joins again after starting it is > added as a new member but the zombie member is there and is included in the > partition assignment which means that those partitions never get consumed > from. What can also happen is that one of the zombies gets group leader so > rebalance gets stuck forever and the group is entirely blocked. > I have not been able to track down where this got introduced between 1.1.0 > and 2.4.1 but I will look further into this. Unfortunately the logs are > essentially silent about the zombie mebers and I only had INFO level logging > on during the issue and by stopping all the consumers in the group and > restarting the broker coordinating that group we could get back to a working > state. -- This message was sent by Atlassian Jira (v8.3.4#803005)