C0urante commented on PR #11783:
URL: https://github.com/apache/kafka/pull/11783#issuecomment-1190816405

   Ah, thanks for the update Luke! In the meantime, I've discovered and 
addressed a few more issues that surfaced during my local runs yesterday:
   
   Since follower workers don't retry when zombie fencing requests to the 
leader fail (which is intentional, as we want to be able to surface failures 
caused by things like insufficient ACLs to perform a round of fencing), it's 
possible that a task that's hosted on a follower may fail during startup if the 
leader has just been bounced the worker process hasn't started yet. I've added 
a small part to restart any failed tasks after all the bounces have completed 
and before we check to make sure that the connector and its tasks are healthy.
   
   Since the REST API is available before workers have actually completed 
startup, it's also possible that requests to fence zombies (and submit task 
configs) can be made to the leader before it has been able to read a session 
key from the config topic. I've tweaked the herder logic to catch this case and 
throw a 503 error with a user-friendly error message. I experimented with some 
other approaches to automatically refresh the leader's view of the config topic 
in this case, and/or handle request signature validation on the herder's tick 
thread (which would ensure that the worker had been able to complete startup 
and read to the current end of the config topic), but the additional complexity 
incurred by these options didn't seem worth the benefits since they would still 
be incomplete for cases like the one described above.
   
   It's also possible that, when hard-bouncing a worker, a transaction opened 
by one of its tasks gets left hanging. If the task has begun to write offsets, 
then startup for subsequent workers will be blocked on the expiration of that 
transaction, which by default takes 60 seconds. This can cause test failures 
because we usually wait for 60 seconds for workers to complete startup. To 
address this, I've lowered the transaction timeout to 10 seconds. Ideally, we 
could proactively abort any open transactions left behind by prior task 
generations during zombie fencing, but it's probably too late to add this kind 
of logic in time for the 3.3.0 release. I've filed 
https://issues.apache.org/jira/browse/KAFKA-14091 to track this.
   
   There's also a possible NPE in `KafkaBasedLog` caused by yet another unsafe 
use of `Utils::closeQuietly`. It's not a major issue since it only occurs when 
the log is shut down before it has had a chance to start, but it's still worth 
patching.
   
   I've kicked off another local run of `test_exactly_once_source` with unclean 
shutdown and the `sessioned` protocol after applying these changes. I've only 
completed five tests so far, but they've all succeeded. Will report the results 
after the other ninety-five runs have completed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to