Greg Harris created KAFKA-10296:
-----------------------------------

             Summary: Connector task reported RUNNING after hard bounce of 
worker
                 Key: KAFKA-10296
                 URL: https://issues.apache.org/jira/browse/KAFKA-10296
             Project: Kafka
          Issue Type: Bug
          Components: KafkaConnect
    Affects Versions: 2.4.1, 2.5.0, 2.3.1
            Reporter: Greg Harris


While fixing flakiness for ConnectDistributedTest.test_bounce in KAFKA-10295, I 
observed that the status of connectors on zombie/offline workers was 
inconsistent during Incremental Cooperative Rebalancing's 
scheduled.delay.max.interval.ms. This is the reproduction case observed there:
 # A task is running on worker A, with another worker B in the same distributed 
cluster
 # Observe that on worker B's REST API, the task is initially correctly 
"RUNNING"
 # Worker A is hard-stopped, and goes offline
 # Observe that on worker B's REST API, the task is still "RUNNING"
 # The group rebalances without worker A in the group, and begins the delay
 # Observe that on worker B's REST API, the task is still "RUNNING"
 # Worker A recovers and joins the group, before the delay expires
 # Observe that on worker B's REST API, the task is still "RUNNING"
 # The rebalance delay expires, and the task is assigned and started
 # Observe that on worker B's REST API, the task is now correctly "RUNNING"

 * In the first state (4), after the worker goes offline, but before the other 
workers learn that they have gone offline, it is acceptable that the task is 
still reported as running. We can't expect that the other workers know that 
worker A has gone offline until the group membership protocol informs them.
 * In the second state (6), when a rebalance occurs and the worker is first 
known to be unhealthy, the state of the task is ambiguous, since it may be down 
completely, or running on a zombie worker. I'm not sure how best to capture 
this state under the existing enum's options, but it's probably closest to 
"UNASSIGNED" since the leader doesn't think that any worker is currently 
running that task.
 * In the third state (8), when the bounced worker returns, the task is 
reported RUNNING on a worker which does not have the task assigned. This is the 
most inaccurate state reported, since the cluster has reached consensus, and 
yet the REST API still reports the wrong state of the task.

State (6) could be assigned a new state "UNKNOWN", introduced by a KIP. 
However, this is a large investment in process and time for what amounts to an 
almost cosmetic change, and we could either leave this state as "RUNNING", or 
change it to "UNASSIGNED"
State (8) could be described by "UNASSIGNED", and would be a tangible 
improvement for test_bounce, which currently needs to intentionally ignore the 
REST API's result here because it is inaccurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to