[ 
https://issues.apache.org/jira/browse/KAFKA-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantine Karantasis resolved KAFKA-9849.
-------------------------------------------
    Resolution: Fixed

> Fix issue with worker.unsync.backoff.ms creating zombie workers when 
> incremental cooperative rebalancing is used
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9849
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9849
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.3.1, 2.5.0, 2.4.1
>            Reporter: Konstantine Karantasis
>            Assignee: Konstantine Karantasis
>            Priority: Major
>             Fix For: 2.3.2, 2.6.0, 2.4.2, 2.5.1
>
>
> {{worker.unsync.backoff.ms}} is a property that was introduced a while ago 
> when eager (stop-the-world) rebalancing was the only option for Connect 
> workers. The goal of this property is to avoid triggering consecutive 
> rebalances when a worker fails to catch up with the config topic in time and 
> therefore voluntarily leaves the group with a {{LeaveGroupRequest}}.
> With incremental cooperative rebalancing this backoff 
> ({{worker.unsync.backoff.ms) }}that has a default value equal to the default 
> value of {{scheduled.rebalance.max.delay.ms}} (5min) might end up turning a 
> worker into a zombie worker that retains its tasks but stays out of the 
> group. This worker, by backing off from rebalancing, leaves not option to the 
> leader of the group but to reassign the missing tasks that were thought as 
> lost to other members of the group if the worker that backs off does not 
> return in time before {{scheduled.rebalance.max.delay.ms}} expires. 
> Clearly, {{worker.unsync.backoff.ms}} was introduced to avoid rebalancing 
> storms under the presence of intermittent connectivity issues with eager 
> rebalancing. However when incremental cooperative rebalancing is used this 
> property might inadvertently make workers operate as zombie workers that keep 
> running tasks while they are out of the group.
> Of course, a good tradeoff needs to be made between avoiding to make the 
> protocol too eager again and at the same time avoiding to turn workers into 
> zombies when connection is not lost for too long from the broker coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to