jasonk000 opened a new pull request #12099:
URL: https://github.com/apache/druid/pull/12099


   ### Description
   
   Improves the stability of Overlord and all tasks in a cluster when there are 
large (1000+) task counts, by reducing contention between the management thread 
and the reception of status updates from the cluster.
   
   #### Introduce GuardedBy to TaskQueue
   
   .. and fix some existing missed spots.
   
   #### Introduce TaskQueueScaleTest
   
   To test scalability of starting and stopping 1000 tasks (set with a 60sec 
timeout), that currently fails and is fixed in the next commit.
   
   #### Reduce TaskQueue contention
   
   Reduce the duration of holding the `giant` critical lock, which increases 
responsiveness:
   - Break apart the TaskQueue-Manager manage loop to a critical (locked) 
section and a section that can be run concurrently with notifications (ie: 
sending any necessary shutdown requests).
     - This is the most important part of the change, since in the existing 
code the blocking shutdown requests are performed inside the loop. By moving 
the blocking calls outside the loop we make it possible for status 
notifications to be promptly processed.
   - Minimise duration that notifyStatus calls take holding the giant lock.
   - Move other logging etc outside the critical section where possible.
   
   ### Design decisions
   
   I chose a BlockingQueue implementation because it is easy to reason about 
the submission / poll / offer ordering. Other options would be Semaphore etc.
   
   There is potential future work:
   - Current behaviour, if a task `shutdown()` call is slow it slows down 
submission of tasks across the whole loop - this is not improved by the PR.
   - This could be mitigated by introducing an Executor, or by applying a 
decision that shutdown() call implementations are non-blocking.
   - If recommended, I'd suggest we do this as a separate PR.
   
   This follows the mailing list discussions here:
   https://lists.apache.org/thread/9jgdwrodwsfcg98so6kzfhdmn95gzyrj
   
   This PR has:
   - [x] been self-reviewed.
      - [x] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [x] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] been tested in a test Druid cluster. (as part of another block of 
changes).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to