> On April 25, 2017, 3:01 a.m., David McLaughlin wrote:
> > We have complete initial scale testing of this patch with updates spanning 
> > 10 to 10k instances across 10k agents. Here are the findings:
> > 
> > 1) The patch works great for small and medium sized updates.
> > 2) For large updates things start with significant performance upgrades and 
> > then eventually degrade, causing cache hits to degrade to almost 0% (where 
> > it resorts to performance on master). 
> > 3) Initially we believed the offers were taking too long due to compaction, 
> > but the overhead there turned out to be only a couple of seconds.
> > 4) We believe we have root caused the degrading cache hits to interference 
> > from the task history pruner. 
> > 5) Expanding the timeout to 2 minutes doesn't seem to help either, the 
> > performance degradation due to (4) is quite severe. 
> > 
> > See attached screenshots. 
> > 
> > Anecdotally, this explains an issue we've frequently witnessed when 
> > extremely large services (5~8k instances) caused cluster-wide slowdown even 
> > when capacity was readily available. 
> > 
> > Next steps are to confirm and address the task history pruning issue.
> 
> David McLaughlin wrote:
>     Another update:
>     
>     After (a lot) of testing, we tracked this down to the scheduling penalty 
> in TaskGroups. Unfortunately there is a bug in the penalty metric calculation 
> (the counter isn't incremented when no tasks in a batch manage to be 
> scheduled) which meant we falsely ruled this out. After ruling out GC and the 
> async workers, we revisited the metric calculation and discovered the bug. 
> From there, we were able to tune various settings to improve cache hit 
> performance. But there are also sometimes still cases where cache hit % 
> degrades to 0 and stays there for large updates.  
>     
>     Tuning is complicated because you have to consider different update batch 
> sizes vs number of concurrent updates vs max schedule attempts vs tasks per 
> group (and every other setting in SchedulingModule really). On top of all of 
> this, you also need to tune carefully to avoid being adversely affected by 
> your chronically failing and permanently pending tasks too. 
>     
>     The goal is to make sure the tasks waiting for reservations to be freed 
> up aren't punished too heavily, without also repeating work for bad actors. 
>     
>     Probably the worst property is the fact that once you start getting cache 
> misses, it's very hard to recover - this is because a cache miss falls back 
> to the regular scheduling algorithm which can also fail to finding matching 
> offers and this only adds to the delay.  
>     
>     We could probably avoid most of these issues if we could somehow connect 
> the killing of tasks for updates into the currently scheduling throughput... 
> but that would require a huge refactor. 
>     
>     Currently we manage 100% cache hit with high number of concurrent updates 
> (~1k+ instances updated per minute) by lowering the worst case scheduling 
> penalty and increasing the number of tasks considered per job.
>     
>     It's also worth noting we'd also see the behavior we've ran into with 
> dynamic reservations that had 1 minute timeouts.
> 
> Stephan Erb wrote:
>     Thanks a lot for keeping us posted! 
>     
>     Three questions:
>     
>     a) Do your TaskGroups findings rule out the influence of the task history 
> pruner? Or do you already have a workaround for it? Looking at the code, it 
> seems to be quadratic in the size of terminated tasks, so it could very 
> likely have an effect on scheduler throughput as well (as each terminated 
> instance will trigger an async function looking at all terminated instances).
>     
>     b) What scheduling penalty and what number of tasks per scheduling 
> attempt did you end up using?
>     
>     c) Have you considered relaxing the affinity to be based on 
> `TaskGroupKey` rather than per `InstanceKey`? As this is an optimization 
> meant for scheduling throughput and not for persistent offers, we don't 
> really care which instance re-uses a slot. A scheduled instance might thus 
> use one of the `batch_size` reservations, even if the reservation of its 
> previous slot has long expired.

a) I believe so. Although we should still improve the task history pruning 
algorithm. 
b) To get the results in my latest chart we had an update reservation hold time 
of 5mins, a max penalty of 5secs and tasks per job of 25. Note that bumping the 
reservation hold time to 2 minutes alone without changing the other settings 
only delays the problem by a few minutes (because of the binary backoff 
associated with the pending penalty). To obtain high throughput without having 
a steadily increasing MTTA, it is only reducing the max penalty and increasing 
the tasks per job that helps. One idea I had was to avoid using the 
TruncatedBinaryBackoff when update reservations are present (this would require 
passing the UpdateAgentReserver to TaskGroups.. starting to worry about leaky 
abstractions there). 
c) I started with TaskGroupKey initially, but had to switch to using 
InstanceKey to support multiple tasks from the same job on the same host.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58259/#review172889
-----------------------------------------------------------


On May 2, 2017, 1:32 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58259/
> -----------------------------------------------------------
> 
> (Updated May 2, 2017, 1:32 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer 
> Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In the Dynamic Reservations review (and on the mailing list), I mentioned 
> that we could implement update affinity with less complexity using the same 
> technique as preemption. Here is how that would work. 
> 
> This just adds a simple wrapper around the preemptor's BiCache structure and 
> then optimistically tries to keep an agent free for a task during the update 
> process. 
> 
> 
> Note: I don't bother even checking the resources before reserving the agent. 
> I figure there is a chance the agent has enough room, and if not we'll catch 
> it when we attempt to veto the offer. We need to always check the offer like 
> this anyway in case constraints change. In the worst case it adds some delay 
> in the rare cases you increase resources. 
> 
> We also don't persist the reservations, so if the Scheduler fails over during 
> an update, the worst case is that any instances between the KILLED and 
> ASSIGNED in-flight batch need to fall back to the current first-fit 
> scheduling algorithm.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 203f62bacc47470545d095e4d25f7e0f25990ed9 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> a177b301203143539b052524d14043ec8a85a46d 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java 
> b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> f25dc0c6d9c05833b9938b023669c9c36a489f68 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  e14112479807b4477b82554caf84fe733f62cf58 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java 
> 13cbdadad606d9acaadc541320b22b0ae538cc5e 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java
>  fa1a81785802b82542030e1aae786fe9570d9827 
>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
> cf2d25ec2e407df7159e0021ddb44adf937e1777 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> b2c4c66850dd8f35e06a631809530faa3b776252 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> c78c7fbd7d600586136863c99ce3d7387895efee 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 30b44f88a5b8477e917da21d92361aea1a39ceeb 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> 833fd62c870f96b96343ee5e0eed0d439536381f 
>   
> src/test/java/org/apache/aurora/scheduler/updater/NullAgentReserverTest.java 
> PRE-CREATION 
>   
> src/test/java/org/apache/aurora/scheduler/updater/UpdateAgentReserverImplTest.java
>  PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/58259/diff/2/
> 
> 
> Testing
> -------
> 
> ./gradlew build
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> File Attachments
> ----------------
> 
> Cache utilization over time
>   
> https://reviews.apache.org/media/uploaded/files/2017/04/25/7b41bd2b-4151-482c-9de2-9dee67c34133__declining-cache-hits.png
> Offer rate from Mesos over time
>   
> https://reviews.apache.org/media/uploaded/files/2017/04/25/b107d964-ee7d-435a-a3d9-2b54f6eac3fa__consistent-offer-rate.png
> Async task workload (scaled) correlation with degraded cache utilization
>   
> https://reviews.apache.org/media/uploaded/files/2017/04/25/7eaf37ac-fbf3-40eb-b3f6-90e914a3936f__async-task-correlation.png
> cache hit rate before and after scheduler tuning
>   
> https://reviews.apache.org/media/uploaded/files/2017/05/02/39998e8d-2a75-4f5d-bfc0-bb93011407af__Screen_Shot_2017-05-01_at_6.30.18_PM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>

Reply via email to