[
https://issues.apache.org/jira/browse/KAFKA-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870747#comment-17870747
]
Chris Egerton commented on KAFKA-17252:
---------------------------------------
Note that this same issue (failure to forward a request to the leader) may
occur for generating task configs as well, but it doesn't have the same impact
because that operation is retried.
> Forwarded source task zombie fencings may fail when leader has just started
> ---------------------------------------------------------------------------
>
> Key: KAFKA-17252
> URL: https://issues.apache.org/jira/browse/KAFKA-17252
> Project: Kafka
> Issue Type: Bug
> Components: connect
> Reporter: Chris Egerton
> Assignee: Chris Egerton
> Priority: Major
>
> We've observed some flaky integration test failures such as [this
> one|https://ge.apache.org/s/52il7msnknzp2/tests/task/:connect:mirror:test/details/org.apache.kafka.connect.mirror.integration.DedicatedMirrorIntegrationTest/testMultiNodeCluster()?top-execution=1]
> where a source task fails to start with exactly-once support enabled with
> this stack trace:
> {code:java}
> org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: This
> worker is still starting up and has not been able to read a session key from
> the config topic yet
> at
> org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:186)
> at
> org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:140)
> at
> org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:101)
> at
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$fenceZombieSourceTasks$23(DistributedHerder.java:1329)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
> at java.base/java.lang.Thread.run(Thread.java:1583) {code}
> This occurs because the leader has not yet read (or believes it has not yet
> read) a session key from the config topic.
> However, in a cluster where all nodes have always used the sessioned
> rebalance protocol, this scenario should be impossible: there must be a
> session key present in the topic in order for a leader to handle external
> requests (such as creating connectors), and all workers must read to the end
> of all internal topics before joining the cluster.
> The cause of this failure is that, during startup, session keys read from the
> config topic are ignored. The herder does [check its config state snapshot
> for a session
> key|https://github.com/apache/kafka/blob/da14b5a61dc90fc70748278c98ce312a7a433c0d/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L463]
> in its tick thread loop, which should help, but it's possible that a worker
> joins the cluster, becomes the leader, and receives a request from a follower
> to fence a zombie source task before this check occurs, which will then cause
> the leader to response with a 503 error, failing the task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)