gianm opened a new pull request, #16158:
URL: https://github.com/apache/druid/pull/16158

   Prior to this patch, when canceled, workers would keep trying to contact the 
controller: they would attempt to report an error, and if they were in the 
midst of some other call (like a counters push) they would keep trying it.
   
   This can cause cancellation to be delayed, because the controller shuts down 
its HTTP server before it cancels workers. Workers are then stuck retrying 
calls to the controller that will never succeed. The retry loops are broken 
when the controller gives up on them (one minute later) and exits for real. 
Then, the controller failure detection logic on the worker detects that the 
controller has failed, and the worker finally shuts down.
   
   This patch speeds up worker cancellation by bypassing communication with the 
controller. There is no real need for it. If the controller canceled the 
workers, it isn't interested in further communications from them. If the 
workers were canceled out-of-band, the controller can detect this through 
worker monitoring and report it as a WorkerFailed error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to