[ https://issues.apache.org/jira/browse/KAFKA-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Konstantine Karantasis resolved KAFKA-9849. ------------------------------------------- Resolution: Fixed > Fix issue with worker.unsync.backoff.ms creating zombie workers when > incremental cooperative rebalancing is used > ---------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-9849 > URL: https://issues.apache.org/jira/browse/KAFKA-9849 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Affects Versions: 2.3.1, 2.5.0, 2.4.1 > Reporter: Konstantine Karantasis > Assignee: Konstantine Karantasis > Priority: Major > Fix For: 2.3.2, 2.6.0, 2.4.2, 2.5.1 > > > {{worker.unsync.backoff.ms}} is a property that was introduced a while ago > when eager (stop-the-world) rebalancing was the only option for Connect > workers. The goal of this property is to avoid triggering consecutive > rebalances when a worker fails to catch up with the config topic in time and > therefore voluntarily leaves the group with a {{LeaveGroupRequest}}. > With incremental cooperative rebalancing this backoff > ({{worker.unsync.backoff.ms) }}that has a default value equal to the default > value of {{scheduled.rebalance.max.delay.ms}} (5min) might end up turning a > worker into a zombie worker that retains its tasks but stays out of the > group. This worker, by backing off from rebalancing, leaves not option to the > leader of the group but to reassign the missing tasks that were thought as > lost to other members of the group if the worker that backs off does not > return in time before {{scheduled.rebalance.max.delay.ms}} expires. > Clearly, {{worker.unsync.backoff.ms}} was introduced to avoid rebalancing > storms under the presence of intermittent connectivity issues with eager > rebalancing. However when incremental cooperative rebalancing is used this > property might inadvertently make workers operate as zombie workers that keep > running tasks while they are out of the group. > Of course, a good tradeoff needs to be made between avoiding to make the > protocol too eager again and at the same time avoiding to turn workers into > zombies when connection is not lost for too long from the broker coordinator. -- This message was sent by Atlassian Jira (v8.3.4#803005)