jasonk000 opened a new pull request #12099:
URL: https://github.com/apache/druid/pull/12099
### Description
Improves the stability of Overlord and all tasks in a cluster when there are
large (1000+) task counts, by reducing contention between the management thread
and the reception of status updates from the cluster.
#### Introduce GuardedBy to TaskQueue
.. and fix some existing missed spots.
#### Introduce TaskQueueScaleTest
To test scalability of starting and stopping 1000 tasks (set with a 60sec
timeout), that currently fails and is fixed in the next commit.
#### Reduce TaskQueue contention
Reduce the duration of holding the `giant` critical lock, which increases
responsiveness:
- Break apart the TaskQueue-Manager manage loop to a critical (locked)
section and a section that can be run concurrently with notifications (ie:
sending any necessary shutdown requests).
- This is the most important part of the change, since in the existing
code the blocking shutdown requests are performed inside the loop. By moving
the blocking calls outside the loop we make it possible for status
notifications to be promptly processed.
- Minimise duration that notifyStatus calls take holding the giant lock.
- Move other logging etc outside the critical section where possible.
### Design decisions
I chose a BlockingQueue implementation because it is easy to reason about
the submission / poll / offer ordering. Other options would be Semaphore etc.
There is potential future work:
- Current behaviour, if a task `shutdown()` call is slow it slows down
submission of tasks across the whole loop - this is not improved by the PR.
- This could be mitigated by introducing an Executor, or by applying a
decision that shutdown() call implementations are non-blocking.
- If recommended, I'd suggest we do this as a separate PR.
This follows the mailing list discussions here:
https://lists.apache.org/thread/9jgdwrodwsfcg98so6kzfhdmn95gzyrj
This PR has:
- [x] been self-reviewed.
- [x] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [x] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [x] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [x] been tested in a test Druid cluster. (as part of another block of
changes).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]