> On Sept. 14, 2016, 6:36 p.m., Zameer Manji wrote:
> > This patch LGTM. Just a single concern here and some questions. I will be
> > moving on to the subsequent patches shortly.
> > Please do not commit this until all parts are shipped. Also, please don't
> > forget to rebase/update subsequent patches if changes are made here.
> > Also, does this cover all state change consumers that interact on storage
> > when the event is fired or is this a subset? Just wondering if this is
> > supposed to apply to all of them or just a small set that you think need
> > this change.
> Maxim Khutornenko wrote:
> > Also, does this cover all state change consumers that interact on
> storage when the event is fired or is this a subset
> This covers all known high frequency consumers that require a write
> transaction. The list is derived by running a stack-dumping version of the
> `CallOrderEnforcingStorage.write()` in a live cluster for awhile and
> analysing top offenders. It may be not fully comprehensive though, so if you
> find any other candidates feel free to report them.
> Zameer Manji wrote:
> I think some comments at the consumers to say that batch worker is
> required for high throughput would be greatly appreciated for future
I'd rather avoid comment copypasta everywhere we use `BatchWorker`. I thought
it's enough to trace the execution from the callsite to the `BatchWorker`
itself to see how/why it's used, no?
This is an automatically generated e-mail. To reply, visit:
On Sept. 14, 2016, 10:41 p.m., Maxim Khutornenko wrote:
> This is an automatically generated e-mail. To reply, visit:
> (Updated Sept. 14, 2016, 10:41 p.m.)
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> Repository: aurora
> This is the first (out of 3) patches intending to reduce storage write lock
> contention and as such improve overall system write throughput. It introduces
> the `BatchWorker` and migrates the majority of storage writes due to task
> status change events to use `TaskEventBatchWorker`.
> Our current storage system writes effectively behave as `SERIALIZABLE`
> transaction isolation level in SQL terms. This means all writes require
> exclusive access to the storage and no two transactions can happen in
> parallel . While it certainly simplifies our implementation, it creates a
> single hotspot where multiple threads are competing for the storage write
> access. This type of contention only worsens as the cluster size grows, more
> tasks are scheduled, more status updates are processed, more subscribers are
> listening to status updates and etc. Eventually, the scheduler throughput
> (and especially task scheduling) becomes degraded to the extent that certain
> operations wait much longer (4x and more) for the lock acquisition than it
> takes to process their payload when inside the transaction. Some ops (like
> event processing) are generally tolerant of these types of delays. Others -
> not as much. The task scheduling suffers the most as backing up the
> scheduling queue directly affects
the Median Time To Assigned (MTTA).
> Given the above, it's natural to assume that reducing the number of write
> transactions should help reducing the lock contention. This patch introduces
> a generic `BatchWorker` service that delivers a "best effort" batching
> approach by redirecting multiple individual write requests into a single FIFO
> queue served non-stop by a single dedicated thread. Every batch shares a
> single write transaction thus reducing the number of potential write lock
> requests. To minimize wait-in-queue time, items are dispatched immediately
> and the max number of items is bounded. There are a few `BatchWorker`
> instances specialized on particular workload types: task even processing,
> cron scheduling and task scheduling. Every instance can be tuned
> independently (max batch size) and provides specialized metrics helping to
> monitor each workload type perf.
> The proposed approach has been heavily tested in production and delivered the
> best results. The lock contention latencies got down between 2x and 5x
> depending on the cluster load. A number of other approaches tried but
> discarded as not performing well or even performing much worse than the
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call
> sites are notified of item completion via a `BatchWorkCompleted` event
> Happy to provide more details on the above if interested.
> The introduction of the `BatchWorker` by itself was not enough to
> substantially improve the MTTA. It, however, paves the way for the next phase
> of scheduling perf improvement - taking more than 1 task from a given
> `TaskGroup` in a single scheduling round (coming soon). That improvement
> wouldn't deliver without decreasing the lock contention first.
> Note: it wasn't easy to have a clean diff split, so some functionality in
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current
> patch but will become obvious in the part 2 (coming out shortly).
>  -
> src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION
> src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java PRE-CREATION
> Diff: https://reviews.apache.org/r/51759/diff/
> All types of testing including deploying to test and production clusters.
> Maxim Khutornenko