[ 
https://issues.apache.org/jira/browse/KAFKA-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870747#comment-17870747
 ] 

Chris Egerton commented on KAFKA-17252:
---------------------------------------

Note that this same issue (failure to forward a request to the leader) may 
occur for generating task configs as well, but it doesn't have the same impact 
because that operation is retried.

> Forwarded source task zombie fencings may fail when leader has just started
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-17252
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17252
>             Project: Kafka
>          Issue Type: Bug
>          Components: connect
>            Reporter: Chris Egerton
>            Assignee: Chris Egerton
>            Priority: Major
>
> We've observed some flaky integration test failures such as [this 
> one|https://ge.apache.org/s/52il7msnknzp2/tests/task/:connect:mirror:test/details/org.apache.kafka.connect.mirror.integration.DedicatedMirrorIntegrationTest/testMultiNodeCluster()?top-execution=1]
>  where a source task fails to start with exactly-once support enabled with 
> this stack trace:
> {code:java}
> org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: This 
> worker is still starting up and has not been able to read a session key from 
> the config topic yet
>         at 
> org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:186)
>         at 
> org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:140)
>         at 
> org.apache.kafka.connect.runtime.rest.RestClient.httpRequest(RestClient.java:101)
>         at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$fenceZombieSourceTasks$23(DistributedHerder.java:1329)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>         at java.base/java.lang.Thread.run(Thread.java:1583) {code}
> This occurs because the leader has not yet read (or believes it has not yet 
> read) a session key from the config topic.
> However, in a cluster where all nodes have always used the sessioned 
> rebalance protocol, this scenario should be impossible: there must be a 
> session key present in the topic in order for a leader to handle external 
> requests (such as creating connectors), and all workers must read to the end 
> of all internal topics before joining the cluster.
> The cause of this failure is that, during startup, session keys read from the 
> config topic are ignored. The herder does [check its config state snapshot 
> for a session 
> key|https://github.com/apache/kafka/blob/da14b5a61dc90fc70748278c98ce312a7a433c0d/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L463]
>  in its tick thread loop, which should help, but it's possible that a worker 
> joins the cluster, becomes the leader, and receives a request from a follower 
> to fence a zombie source task before this check occurs, which will then cause 
> the leader to response with a 503 error, failing the task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to