This is an automatically generated e-mail. To reply, visit:
Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
This is the first (out of 3) patches intending to reduce storage write lock
contention and as such improve overall system write throughput. It introduces
the `BatchWorker` and migrates the majority of storage writes due to task
status change events to use `TaskEventBatchWorker`.
Our current storage system writes effectively behave as `SERIALIZABLE`
transaction isolation level in SQL terms. This means all writes require
exclusive access to the storage and no two transactions can happen in parallel
. While it certainly simplifies our implementation, it creates a single
hotspot where multiple threads are competing for the storage write access. This
type of contention only worsens as the cluster size grows, more tasks are
scheduled, more status updates are processed, more subscribers are listening to
status updates and etc. Eventually, the scheduler throughput (and especially
task scheduling) becomes degraded to the extent that certain operations wait
much longer (4x and more) for the lock acquisition than it takes to process
their payload when inside the transaction. Some ops (like event processing) are
generally tolerant of these types of delays. Others - not as much. The task
scheduling suffers the most as backing up the scheduling queue directly affects
he Median Time To Assigned (MTTA).
Given the above, it's natural to assume that reducing the number of write
transactions should help reducing the lock contention. This patch introduces a
generic `BatchWorker` service that delivers a "best effort" batching approach
by redirecting multiple individual write requests into a single FIFO queue
served non-stop by a single dedicated thread. Every batch shares a single write
transaction thus reducing the number of potential write lock requests. To
minimize wait-in-queue time, items are dispatched immediately and the max
number of items is bounded. There are a few `BatchWorker` instances specialized
on particular workload types: task even processing, cron scheduling and task
scheduling. Every instance can be tuned independently (max batch size) and
provides specialized metrics helping to monitor each workload type perf.
The proposed approach has been heavily tested in production and delivered the
best results. The lock contention latencies got down between 2x and 5x
depending on the cluster load. A number of other approaches tried but discarded
as not performing well or even performing much worse than the current master:
- Clock-driven batch execution - every batch is dispatched on a time schedule
- Max batch with a deadline - a batch is dispatched when max size is reached OR
a timeout expires
- Various combinations of the above - some `BatchWorkers` are using
clock-driven execution while others are using max batch with a deadline
- Completely non-blocking (event-based) completion notification - all call
sites are notified of item completion via a `BatchWorkCompleted` event
Happy to provide more details on the above if interested.
The introduction of the `BatchWorker` by itself was not enough to substantially
improve the MTTA. It, however, paves the way for the next phase of scheduling
perf improvement - taking more than 1 task from a given `TaskGroup` in a single
scheduling round (coming soon). That improvement wouldn't deliver without
decreasing the lock contention first.
Note: it wasn't easy to have a clean diff split, so some functionality in
`BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current
patch but will become obvious in the part 2 (coming out shortly).
All types of testing including deploying to test and production clusters.