> On Sept. 12, 2016, 1:02 a.m., Stephan Erb wrote:
> > Thanks for the excellent write up and the time you took to split the review 
> > into sizable chunks. A full review will take some time. I will start with a 
> > couple of highlevel questions.
> > 
> > a) Do you have some results of the jmh benachmarks that you might share?
> > 
> > b) If I understand this correctly, you are basically trading one lock with 
> > another: Instead of waiting on the global storage lock, threads do now have 
> > to wait on the lock of the `workQueue`. Obviously waiting on that lock is 
> > much faster as there is less work being done in the critical region 
> > protected by that lock. However, isn't that a strategy that we could try to 
> > use as well? For details, see my comment in Part 3.
> 
> Maxim Khutornenko wrote:
>     > a) Do you have some results of the jmh benachmarks that you might share?
>     
>     Good question. Creating a reliable benchmark to simulate a real world 
> lock contention is next to impossible. JMH guidelines warn the accuracy of 
> benchmarking degrades guickly with the number of threads involved. The best 
> we could prove here is that the overhead introduced by the `BatchWorker` is 
> not prohibitive:
>     
>     Master:
>     ```
>     Benchmark                                                                 
>      Mode  Cnt     Score    Error  Units
>     
> SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  
> thrpt   10  2714.177 ± 84.900  ops/s
>     ```
>     
>     Part 3 (https://reviews.apache.org/r/51765/):
>     ```
>     Benchmark                                                                 
>      Mode  Cnt     Score    Error  Units
>     
> SchedulingBenchmarks.LimitConstraintMismatchSchedulingBenchmark.runBenchmark  
> thrpt   10  2664.389 ± 65.013  ops/s
>     ```
>     
>     The results are well within the margin of error and both are a few orders 
> of magnitude higher than the real scheduling perf possible/observed.
>     
>     > b) If I understand this correctly, you are basically trading one lock 
> with another...
>     
>     The major benefit here is improving the response time by reducing the 
> hotspot contention. According to Little's Law [1]:
>     ```
>     MeanResponseTime = MeanNumberInSystem / MeanThroughput
>     ```
>     
>     Given the fixed throughput (native log perf), the only way to reduce 
> queue saturation (and as such improve response time) is to reduce the number 
> of the outstanding requests. Batching multiple individual requests 
> accomplishes exactly that goal. It's worth noting that the above formula is 
> applicable when all requests take about the same time to process. That is 
> *not* the case in our system. Some take longer (task scheduling), others are 
> very low latency (single tasks status processing). Varying the batch size of 
> different types of requests (a few scheduling requests vs. hundreds of task 
> events) lets us level out the request shape (as much as it's physically 
> possible) and as such improve overall perf.
>     
>     As for reducing the critical section weight, I'll address it in the Part 
> 3 comment. In general though, reducing the latency helps to level out the 
> request shape but is still not sufficient to reduce the hotspot (see above).
>     
>     [1] - https://en.wikipedia.org/wiki/Little%27s_law

Oh, I wasn't aware that the native log limits your scheduling throughput. My 
uninformed guess was that you'd spend the most time looping over those O(10000) 
host offers trying to find a suitable one, rather than appending to the native 
log.

Batching requests then makes sense to me.


> On Sept. 12, 2016, 1:02 a.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 58
> > <https://reviews.apache.org/r/51759/diff/1/?file=1495031#file1495031line58>
> >
> >     I am wondering if this queue shouldn't have a fixed capacity, so that 
> > we gain some kind of back pressure throughout the system.
> 
> Maxim Khutornenko wrote:
>     Not sure I follow. The default `LinkedBlockingQueue` constructor uses 
> `Integer.MAX_VALUE` for the max number of items in the queue. That should be 
> more than sufficient for our use case.

Scrape that comment of mine. I feared that this queue could grow without 
bounds, but this is probably nothing to be concerned about.


> On Sept. 12, 2016, 1:02 a.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 240-247
> > <https://reviews.apache.org/r/51759/diff/1/?file=1495031#file1495031line240>
> >
> >     A queue could hold vital information that we should not loose during 
> > scheduler failover (in particular, as we do this once per day).
> >     
> >     Would it be possible to drain the queue completely before exiting here?
> 
> Maxim Khutornenko wrote:
>     This is no different than terminating the main process while threads are 
> waiting on a write lock. Transaction boundaries and reconciliation on restart 
> is what ensures data consistency.

Good point. I have just noticed that we also drop updates in 
`TaskStatusHandlerImpl` that follows the same batches write pattern implemented 
here. 

Might be nice to have a graceful shutdown mechanism eventually, but that would 
be definitely out of scope of this review here.


> On Sept. 12, 2016, 1:02 a.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, lines 257-259
> > <https://reviews.apache.org/r/51759/diff/1/?file=1495031#file1495031line257>
> >
> >     1) Doesn't that effectively result in a busy loop where we are queuing 
> > the same item over and over until it is can finally be executed? This 
> > sounds rather problematic.
> >     
> >     2) This changes the order of work items. Are you sure that no storage 
> > client requires writes to be in order?
> 
> Maxim Khutornenko wrote:
>     > 1) Doesn't that effectively result in a busy loop where we are queuing 
> the same item over and over until it is can finally be executed? This sounds 
> rather problematic.
>     
>     There is no busy loop in a pure sense as every `RepeatableWork` item must 
> have a backoff strategy associated with it. Any item scheduled for 
> re-processing is delayed according to the supplied backoff.
>     
>     > 2) This changes the order of work items. Are you sure that no storage 
> client requires writes to be in order?
>     
>     Order has never been implied or expected as far as cross-transaction 
> boundaries go. That said, the order is preserved for a given work type. We 
> re-queue incomplete items in the same order they were received.

Regarding question 1)

It seems to me that the `run()` loop will drain a batch of WorkItems from the 
`workQueue`. We then call `getReadyItems` and re-queue everything that is 
currently in backoff. This repeats over and over again until the backoff time 
has passed. 

Have you considered using a `@AsyncExecutor DelayExecutor` just like in 
`TaskThrottler` for everything that is in backoff and cannot be executed 
immediately?


- Stephan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review148433
-----------------------------------------------------------


On Sept. 13, 2016, 12:51 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> -----------------------------------------------------------
> 
> (Updated Sept. 13, 2016, 12:51 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #####Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #####Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #####Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #####Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> https://github.com/apache/aurora/blob/f6ac13b169aaad5aad73ef3cc72873781e30a705/src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java#L540-L555
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
> 4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0 
>   src/main/java/org/apache/aurora/scheduler/events/PubsubEvent.java 
> 70b5470b9dad1af838b5222cae5ac86487e2f2e4 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f07746c2b990c1c2235e99f9e4775fc84f9c27b1 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskThrottler.java 
> bbd971a2aa8a96cf79edd879ad60e1bebd933d79 
>   src/main/java/org/apache/aurora/scheduler/state/MaintenanceController.java 
> 3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  594bb6219294dcc77d48dcad14e2a6f9caa0c534 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java PRE-CREATION 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 99c27e8012f10a67ce5f1b84d258e7a5608995c7 
>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskThrottlerTest.java 
> 7d104aa2ea4a4d99be4711f666d18beca238284e 
>   
> src/test/java/org/apache/aurora/scheduler/state/MaintenanceControllerImplTest.java
>  94f5ca565476f62d72879837a0e7dafabcf30432 
>   src/test/java/org/apache/aurora/scheduler/testing/BatchWorkerUtil.java 
> PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 196df4754b553f05e50b66ad2f84271901bc9eba 
> 
> Diff: https://reviews.apache.org/r/51759/diff/
> 
> 
> Testing
> -------
> 
> All types of testing including deploying to test and production clusters.
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Reply via email to