This is an automatically generated e-mail. To reply, visit:
I noticed that `CompletableFuture` might be suitable here. I will hold off a
deeper review until I'm convinced that it is not suitable. I feel if it were
possible to adopt and it was adopted here it could simplify the design.
Overall, I like the approach of queueing up multiple writes together and then
executing them together inside the same storage lock. I think the interface is
suitable for this purpose.
src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 105)
I agree we cannot use `java.util.concurrent.FutureTask` here. However, this
seems to be pretty close to `java.util.concurrent.CompletableFuture<T>` (new in
It's not clear to me here if you have investigated it or not, but if you
have I would like a brief explanation as to why it is not suitable. If you
haven't please take a look at it as it seems to provide all of the funtionality
Further if it is possible to use it, then we do not need pubsub for
notification for when work completes, if item is it's own future, we can use
the `CompletionStage` functionality to execute code after a future has been
It has a `get()` method that can replace `waitForResult()` and a
`complete()` method that allows you to set the value.
src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 188)
Our current metrics use 'nanos' to indicate nanoseconds instead of 'ns'.
Mind using that here for consistency?
- Zameer Manji
On Sept. 12, 2016, 3:51 p.m., Maxim Khutornenko wrote:
> This is an automatically generated e-mail. To reply, visit:
> (Updated Sept. 12, 2016, 3:51 p.m.)
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> Repository: aurora
> This is the first (out of 3) patches intending to reduce storage write lock
> contention and as such improve overall system write throughput. It introduces
> the `BatchWorker` and migrates the majority of storage writes due to task
> status change events to use `TaskEventBatchWorker`.
> Our current storage system writes effectively behave as `SERIALIZABLE`
> transaction isolation level in SQL terms. This means all writes require
> exclusive access to the storage and no two transactions can happen in
> parallel . While it certainly simplifies our implementation, it creates a
> single hotspot where multiple threads are competing for the storage write
> access. This type of contention only worsens as the cluster size grows, more
> tasks are scheduled, more status updates are processed, more subscribers are
> listening to status updates and etc. Eventually, the scheduler throughput
> (and especially task scheduling) becomes degraded to the extent that certain
> operations wait much longer (4x and more) for the lock acquisition than it
> takes to process their payload when inside the transaction. Some ops (like
> event processing) are generally tolerant of these types of delays. Others -
> not as much. The task scheduling suffers the most as backing up the
> scheduling queue directly affects
the Median Time To Assigned (MTTA).
> Given the above, it's natural to assume that reducing the number of write
> transactions should help reducing the lock contention. This patch introduces
> a generic `BatchWorker` service that delivers a "best effort" batching
> approach by redirecting multiple individual write requests into a single FIFO
> queue served non-stop by a single dedicated thread. Every batch shares a
> single write transaction thus reducing the number of potential write lock
> requests. To minimize wait-in-queue time, items are dispatched immediately
> and the max number of items is bounded. There are a few `BatchWorker`
> instances specialized on particular workload types: task even processing,
> cron scheduling and task scheduling. Every instance can be tuned
> independently (max batch size) and provides specialized metrics helping to
> monitor each workload type perf.
> The proposed approach has been heavily tested in production and delivered the
> best results. The lock contention latencies got down between 2x and 5x
> depending on the cluster load. A number of other approaches tried but
> discarded as not performing well or even performing much worse than the
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call
> sites are notified of item completion via a `BatchWorkCompleted` event
> Happy to provide more details on the above if interested.
> The introduction of the `BatchWorker` by itself was not enough to
> substantially improve the MTTA. It, however, paves the way for the next phase
> of scheduling perf improvement - taking more than 1 task from a given
> `TaskGroup` in a single scheduling round (coming soon). That improvement
> wouldn't deliver without decreasing the lock contention first.
> Note: it wasn't easy to have a clean diff split, so some functionality in
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current
> patch but will become obvious in the part 2 (coming out shortly).
>  -
> src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION
> src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java PRE-CREATION
> Diff: https://reviews.apache.org/r/51759/diff/
> All types of testing including deploying to test and production clusters.
> Maxim Khutornenko