> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > Thanks for the excellent write up and the time you took to split the review 
> > into sizable chunks. A full review will take some time. I will start with a 
> > couple of highlevel questions.
> > 
> > a) Do you have some results of the jmh benachmarks that you might share?
> > 
> > b) If I understand this correctly, you are basically trading one lock with 
> > another: Instead of waiting on the global storage lock, threads do now have 
> > to wait on the lock of the `workQueue`. Obviously waiting on that lock is 
> > much faster as there is less work being done in the critical region 
> > protected by that lock. However, isn't that a strategy that we could try to 
> > use as well? For details, see my comment in Part 3.

> a) Do you have some results of the jmh benachmarks that you might share?

Good question. Creating a reliable benchmark to simulate a real world lock 
contention is next to impossible. JMH guidelines warn the accuracy of 
benchmarking degrades guickly with the number of threads involved. The best we 
could prove here is that the overhead introduced by the `BatchWorker` is not 
prohibitive:

Master:
```
Benchmark                                                                      
Mode  Cnt     Score    Error  Units
SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  
thrpt   10  2714.177 ± 84.900  ops/s
```

Part 3 (https://reviews.apache.org/r/51765/):
```
Benchmark                                                                      
Mode  Cnt     Score    Error  Units
SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  
thrpt   10  2664.389 ± 65.013  ops/s
```

The results are well within the margin of error and both are a few orders of 
magnitude higher than the real scheduling perf possible/observed.

> b) If I understand this correctly, you are basically trading one lock with 
> another...

The major benefit here is improving the response time by reducing the hotspot 
contention. According to Little's Law [1]:
```
MeanResponseTime = MeanNumberInSystem / MeanThroughput
```

Given the fixed throughput (native log perf), the only way to reduce queue 
saturation (and as such improve response time) is to reduce the number of the 
outstanding requests. Batching multiple individual requests accomplishes 
exactly that goal. It's worth noting that the above formula is applicable when 
all requests take about the same time to process. That is *not* the case in our 
system. Some take longer (task scheduling), others are very low latency (single 
tasks status processing). Varying the batch size of different types of requests 
(a few scheduling requests vs. hundreds of task events) lets us level out the 
request shape (as much as it's physically possible) and as such improve overall 
perf.

As for reducing the critical section weight, I'll address it in the Part 3 
comment. In general though, reducing the latency helps to level out the request 
shape but is still not sufficient to reduce the hotspot (see above).

[1] - https://en.wikipedia.org/wiki/Little%27s_law


> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 58
> > <https://reviews.apache.org/r/51759/diff/1/?file=1495031#file1495031line58>
> >
> >     I am wondering if this queue shouldn't have a fixed capacity, so that 
> > we gain some kind of back pressure throughout the system.

Not sure I follow. The default `LinkedBlockingQueue` constructor uses 
`Integer.MAX_VALUE` for the max number of items in the queue. That should be 
more than sufficient for our use case.


> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 164
> > <https://reviews.apache.org/r/51759/diff/1/?file=1495031#file1495031line164>
> >
> >     `Work` is a subclass of `RepeatableWork` but meant to express something 
> > non-repeatable. This sounds like a violation of the [Liskov Substitution 
> > Principle]( https://en.wikipedia.org/wiki/Liskov_substitution_principle). 
> >     
> >     In other words, I feel like `RepeatableWork` should extend `Work` and 
> > not the other way round.

LSP is a pricinple applicable to mutable types. These are interfaces. The 
`RepeatableWork` is a broader type that supports repetition. The `Work` type a 
convenience wrapper limiting to just one repetition. We use similar approach in 
`Storage`. See `StorageOperation` and the multitude of derived interfaces 
(`Quiet`, `NoResult` and etc.)


> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 240-247
> > <https://reviews.apache.org/r/51759/diff/1/?file=1495031#file1495031line240>
> >
> >     A queue could hold vital information that we should not loose during 
> > scheduler failover (in particular, as we do this once per day).
> >     
> >     Would it be possible to drain the queue completely before exiting here?

This is no different than terminating the main process while threads are 
waiting on a write lock. Transaction boundaries and reconciliation on restart 
is what ensures data consistency.


> On Sept. 11, 2016, 11:02 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 257-259
> > <https://reviews.apache.org/r/51759/diff/1/?file=1495031#file1495031line257>
> >
> >     1) Doesn't that effectively result in a busy loop where we are queuing 
> > the same item over and over until it is can finally be executed? This 
> > sounds rather problematic.
> >     
> >     2) This changes the order of work items. Are you sure that no storage 
> > client requires writes to be in order?

> 1) Doesn't that effectively result in a busy loop where we are queuing the 
> same item over and over until it is can finally be executed? This sounds 
> rather problematic.

There is no busy loop in a pure sense as every `RepeatableWork` item must have 
a backoff strategy associated with it. Any item scheduled for re-processing is 
delayed according to the supplied backoff.

> 2) This changes the order of work items. Are you sure that no storage client 
> requires writes to be in order?

Order has never been implied or expected as far as cross-transaction boundaries 
go. That said, the order is preserved for a given work type. We re-queue 
incomplete items in the same order they were received.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148433
-----------------------------------------------------------


On Sept. 9, 2016, 5:29 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> -----------------------------------------------------------
> 
> (Updated Sept. 9, 2016, 5:29 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #####Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #####Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #####Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #####Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
>   src/main/java/org/apache/aurora/scheduler/events/PubsubEvent.java 
> 70b5470b9dad1af838b5222cae5ac86487e2f2e4 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
> bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
>   src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
> 3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  594bb6219294dcc77d48dcad14e2a6f9caa0c534 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java PRE-CREATION 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 99c27e8012f10a67ce5f1b84d258e7a5608995c7 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskThrottlerTest.java 
> 7d104aa2ea4a4d99be4711f666d18beca238284e 
>   
> src/test/java/org/apache/aurora/scheduler/state/MaintenanceControllerImplTest.java
>  94f5ca565476f62d72879837a0e7dafabcf30432 
>   src/test/java/org/apache/aurora/scheduler/testing/BatchWorkerUtil.java 
> PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 196df4754b553f05e50b66ad2f84271901bc9eba 
> 
> Diff: https://reviews.apache.org/r/51759/diff/
> 
> 
> Testing
> -------
> 
> All types of testing including deploying to test and production clusters.
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Reply via email to