kfaraz opened a new pull request, #16462:
URL: https://github.com/apache/druid/pull/16462
### Description
The task status API was improved in #15724 to serve task statuses from the
Overlord memory.
But on older Overlord versions, this API would return an `unknown` task
status location causing failures in inter-task communications.
This was later fixed in #16227 but that introduced another bug described
below. This bug is typically reproducible in MSQ controller tasks but may occur
in native batch ingestion as well.
- The `SpecificTaskServiceLocator` first calls the multi-task status API
`/druid/indexer/v1/taskStatus` to determine the location of a task.
- This API returns an unknown task location on old Overlord versions
- The `SpecificTaskServiceLocator` then falls back to calling the single
task status API `/druid/indexer/v1/task/{taskId}/status`
- The problem is that the second API is invoked in a synchronous manner in
the callback of the first future while holding a lock
- If there are enough MSQ workers, this could cause all the
`ServiceClientFactory` threads to be stuck waiting to get back a task location.
But to fetch the task location, we need one of the `ServiceClientFactory`
threads.
### Fix
- Invoke the single task status API in an async manner
- Ensure that success/failure callbacks are quick and do not keep threads
blocked
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]