kfaraz opened a new pull request, #16462:
URL: https://github.com/apache/druid/pull/16462

   ### Description
   
   The task status API was improved in #15724 to serve task statuses from the 
Overlord memory.
   But on older Overlord versions, this API would return an `unknown` task 
status location causing failures in inter-task communications.
   
   This was later fixed in #16227 but that introduced another bug described 
below. This bug is typically reproducible in MSQ controller tasks but may occur 
in native batch ingestion as well.
   
   - The `SpecificTaskServiceLocator` first calls the multi-task status API 
`/druid/indexer/v1/taskStatus` to determine the location of a task.
   - This API returns an unknown task location on old Overlord versions
   - The `SpecificTaskServiceLocator` then falls back to calling the single 
task status API `/druid/indexer/v1/task/{taskId}/status`
   - The problem is that the second API is invoked in a synchronous manner in 
the callback of the first future while holding a lock
   - If there are enough MSQ workers, this could cause all the 
`ServiceClientFactory` threads to be stuck waiting to get back a task location. 
But to fetch the task location, we need one of the `ServiceClientFactory` 
threads.
   
   ### Fix
   - Invoke the single task status API in an async manner
   - Ensure that success/failure callbacks are quick and do not keep threads 
blocked
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to