C0urante commented on PR #11783: URL: https://github.com/apache/kafka/pull/11783#issuecomment-1190816405
Ah, thanks for the update Luke! In the meantime, I've discovered and addressed a few more issues that surfaced during my local runs yesterday: Since follower workers don't retry when zombie fencing requests to the leader fail (which is intentional, as we want to be able to surface failures caused by things like insufficient ACLs to perform a round of fencing), it's possible that a task that's hosted on a follower may fail during startup if the leader has just been bounced the worker process hasn't started yet. I've added a small part to restart any failed tasks after all the bounces have completed and before we check to make sure that the connector and its tasks are healthy. Since the REST API is available before workers have actually completed startup, it's also possible that requests to fence zombies (and submit task configs) can be made to the leader before it has been able to read a session key from the config topic. I've tweaked the herder logic to catch this case and throw a 503 error with a user-friendly error message. I experimented with some other approaches to automatically refresh the leader's view of the config topic in this case, and/or handle request signature validation on the herder's tick thread (which would ensure that the worker had been able to complete startup and read to the current end of the config topic), but the additional complexity incurred by these options didn't seem worth the benefits since they would still be incomplete for cases like the one described above. It's also possible that, when hard-bouncing a worker, a transaction opened by one of its tasks gets left hanging. If the task has begun to write offsets, then startup for subsequent workers will be blocked on the expiration of that transaction, which by default takes 60 seconds. This can cause test failures because we usually wait for 60 seconds for workers to complete startup. To address this, I've lowered the transaction timeout to 10 seconds. Ideally, we could proactively abort any open transactions left behind by prior task generations during zombie fencing, but it's probably too late to add this kind of logic in time for the 3.3.0 release. I've filed https://issues.apache.org/jira/browse/KAFKA-14091 to track this. There's also a possible NPE in `KafkaBasedLog` caused by yet another unsafe use of `Utils::closeQuietly`. It's not a major issue since it only occurs when the log is shut down before it has had a chance to start, but it's still worth patching. I've kicked off another local run of `test_exactly_once_source` with unclean shutdown and the `sessioned` protocol after applying these changes. I've only completed five tests so far, but they've all succeeded. Will report the results after the other ninety-five runs have completed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org