Chris Egerton created KAFKA-17130:
-------------------------------------

             Summary: Connect workers do not properly ensure group membership 
before responding to health checks
                 Key: KAFKA-17130
                 URL: https://issues.apache.org/jira/browse/KAFKA-17130
             Project: Kafka
          Issue Type: Bug
          Components: connect
    Affects Versions: 3.8.0, 3.9.0
            Reporter: Chris Egerton


Initially reported [here|https://github.com/apache/kafka/pull/16585].

When a distributed Connect worker's herder begins an iteration of its tick 
loop, it tries to ensure that the worker is still in contact with the Kafka 
cluster that's used for cluster coordination and internal topics; see 
[here|https://github.com/apache/kafka/blob/0ada8fac6869cad8ac33a79032cf5d57bfa2a3ea/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L419].

However, this method may return even if the Kafka cluster is down. It does not 
force a heartbeat request to be sent to the broker, and may return if the time 
since the last heartbeat is small enough.

We may want to force at least one request (possibly, specifically a heartbeat) 
to the group coordinator to have been sent before returning from 
{{WorkerGroupMember::ensureActive}} in order to guarantee that the health check 
point only returns 200 if it has explicitly validated the health of the 
worker's connection to the group coordinator after the request to the endpoint 
was initiated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to