Re: Review Request 58259: Add update affinity to Scheduler

Stephan Erb Tue, 02 May 2017 00:58:40 -0700


> On April 25, 2017, 5:01 a.m., David McLaughlin wrote:
> > We have complete initial scale testing of this patch with updates spanning 
> > 10 to 10k instances across 10k agents. Here are the findings:
> > 
> > 1) The patch works great for small and medium sized updates.
> > 2) For large updates things start with significant performance upgrades and 
> > then eventually degrade, causing cache hits to degrade to almost 0% (where 
> > it resorts to performance on master). 
> > 3) Initially we believed the offers were taking too long due to compaction, 
> > but the overhead there turned out to be only a couple of seconds.
> > 4) We believe we have root caused the degrading cache hits to interference 
> > from the task history pruner. 
> > 5) Expanding the timeout to 2 minutes doesn't seem to help either, the 
> > performance degradation due to (4) is quite severe. 
> > 
> > See attached screenshots. 
> > 
> > Anecdotally, this explains an issue we've frequently witnessed when 
> > extremely large services (5~8k instances) caused cluster-wide slowdown even 
> > when capacity was readily available. 
> > 
> > Next steps are to confirm and address the task history pruning issue.
> 
> David McLaughlin wrote:
>     Another update:
>     
>     After (a lot) of testing, we tracked this down to the scheduling penalty 
> in TaskGroups. Unfortunately there is a bug in the penalty metric calculation 
> (the counter isn't incremented when no tasks in a batch manage to be 
> scheduled) which meant we falsely ruled this out. After ruling out GC and the 
> async workers, we revisited the metric calculation and discovered the bug. 
> From there, we were able to tune various settings to improve cache hit 
> performance. But there are also sometimes still cases where cache hit % 
> degrades to 0 and stays there for large updates.  
>     
>     Tuning is complicated because you have to consider different update batch 
> sizes vs number of concurrent updates vs max schedule attempts vs tasks per 
> group (and every other setting in SchedulingModule really). On top of all of 
> this, you also need to tune carefully to avoid being adversely affected by 
> your chronically failing and permanently pending tasks too. 
>     
>     The goal is to make sure the tasks waiting for reservations to be freed 
> up aren't punished too heavily, without also repeating work for bad actors. 
>     
>     Probably the worst property is the fact that once you start getting cache 
> misses, it's very hard to recover - this is because a cache miss falls back 
> to the regular scheduling algorithm which can also fail to finding matching 
> offers and this only adds to the delay.  
>     
>     We could probably avoid most of these issues if we could somehow connect 
> the killing of tasks for updates into the currently scheduling throughput... 
> but that would require a huge refactor. 
>     
>     Currently we manage 100% cache hit with high number of concurrent updates 
> (~1k+ instances updated per minute) by lowering the worst case scheduling 
> penalty and increasing the number of tasks considered per job.
>     
>     It's also worth noting we'd also see the behavior we've ran into with 
> dynamic reservations that had 1 minute timeouts.


Thanks a lot for keeping us posted! 

Three questions:

a) Do your TaskGroups findings rule out the influence of the task history 
pruner? Or do you already have a workaround for it? Looking at the code, it 
seems to be quadratic in the size of terminated tasks, so it could very likely 
have an effect on scheduler throughput as well (as each terminated instance 
will trigger an async function looking at all terminated instances).

b) What scheduling penalty and what number of tasks per scheduling attempt did 
you end up using?

c) Have you considered relaxing the affinity to be based on `TaskGroupKey` 
rather than per `InstanceKey`? As this is an optimization meant for scheduling 
throughput and not for persistent offers, we don't really care which instance 
re-uses a slot. A scheduled instance might thus use one of the `batch_size` 
reservations, even if the reservation of its previous slot has long expired.


- Stephan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58259/#review172889
-----------------------------------------------------------


On May 2, 2017, 3:32 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58259/
> -----------------------------------------------------------
> 
> (Updated May 2, 2017, 3:32 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer 
> Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In the Dynamic Reservations review (and on the mailing list), I mentioned 
> that we could implement update affinity with less complexity using the same 
> technique as preemption. Here is how that would work. 
> 
> This just adds a simple wrapper around the preemptor's BiCache structure and 
> then optimistically tries to keep an agent free for a task during the update 
> process. 
> 
> 
> Note: I don't bother even checking the resources before reserving the agent. 
> I figure there is a chance the agent has enough room, and if not we'll catch 
> it when we attempt to veto the offer. We need to always check the offer like 
> this anyway in case constraints change. In the worst case it adds some delay 
> in the rare cases you increase resources. 
> 
> We also don't persist the reservations, so if the Scheduler fails over during 
> an update, the worst case is that any instances between the KILLED and 
> ASSIGNED in-flight batch need to fall back to the current first-fit 
> scheduling algorithm.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 203f62bacc47470545d095e4d25f7e0f25990ed9 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> a177b301203143539b052524d14043ec8a85a46d 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java 
> b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> f25dc0c6d9c05833b9938b023669c9c36a489f68 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  e14112479807b4477b82554caf84fe733f62cf58 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java 
> 13cbdadad606d9acaadc541320b22b0ae538cc5e 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java
>  fa1a81785802b82542030e1aae786fe9570d9827 
>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
> cf2d25ec2e407df7159e0021ddb44adf937e1777 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> b2c4c66850dd8f35e06a631809530faa3b776252 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> c78c7fbd7d600586136863c99ce3d7387895efee 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 30b44f88a5b8477e917da21d92361aea1a39ceeb 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> 833fd62c870f96b96343ee5e0eed0d439536381f 
>   
> src/test/java/org/apache/aurora/scheduler/updater/NullAgentReserverTest.java 
> PRE-CREATION 
>   
> src/test/java/org/apache/aurora/scheduler/updater/UpdateAgentReserverImplTest.java
>  PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/58259/diff/2/
> 
> 
> Testing
> -------
> 
> ./gradlew build
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> File Attachments
> ----------------
> 
> Cache utilization over time
>   
> https://reviews.apache.org/media/uploaded/files/2017/04/25/7b41bd2b-4151-482c-9de2-9dee67c34133__declining-cache-hits.png
> Offer rate from Mesos over time
>   
> https://reviews.apache.org/media/uploaded/files/2017/04/25/b107d964-ee7d-435a-a3d9-2b54f6eac3fa__consistent-offer-rate.png
> Async task workload (scaled) correlation with degraded cache utilization
>   
> https://reviews.apache.org/media/uploaded/files/2017/04/25/7eaf37ac-fbf3-40eb-b3f6-90e914a3936f__async-task-correlation.png
> cache hit rate before and after scheduler tuning
>   
> https://reviews.apache.org/media/uploaded/files/2017/05/02/39998e8d-2a75-4f5d-bfc0-bb93011407af__Screen_Shot_2017-05-01_at_6.30.18_PM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>

Re: Review Request 58259: Add update affinity to Scheduler

Reply via email to