Re: Review Request 58259: Add update affinity to Scheduler

David McLaughlin Tue, 02 May 2017 16:55:45 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58259/
-----------------------------------------------------------


(Updated May 2, 2017, 11:55 p.m.)


Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer 
Manji.


Changes
-------

After adding the missing penalty metric, we found the penalty did not account 
for the lost time. Another theory debunked! But thankfully we have root caused 
this now. 

We added debug logging and saw that there was 5~10s delay between each batch 
even when the penalty reported was less than 10ms. But the total time spent in 
TaskScheduler.schedule was less than 2secs across the entire minute. So it 
wasn't a slow scheduling loop that was causing the delay. 

In the end it turns out that it's the sheer volume of update instance 
transitions in our scale test jobs. I've attached a screenshot showing 59secs 
in a minute were spent waiting on locks in TaskEventBatchWorker. Grepping 
through the Scheduler logs show that all this work is being done by 
JobUpdateControllerImpl. So it looks like our scale test was hitting the 
theoritical throughput limitations of the Scheduler update (and the update 
storage in particular). Seems like the update controller maxes out at around 
2.5k update instance transitions per minute. 

To confirm that there wasn't contention with other processes, we also disabled 
TaskHistoryPruner (which also uses TaskEventBatchWorker as an executor) and 
re-ran the test and it had similar results. 

The magic reservation hold time we found to maintain 100% cache hit for our 
~10k instances across 10 jobs is 3 minutes, but I also believe that in a real 
job (i.e. not our fake scale test jobs) the time to download binaries and 
start, etc. would add some delay between batches and reduce the amount of work 
the JobUpdateControllerImpl has to do per minute - so you could probably tune 
your timeout based on the typical watch secs at your company. We'll do more 
experimenting with this and report findings. 

One other thing I should mention: when the cache hit is 100% for our scale 
test, updating 10k instances across 10 jobs takes around 12 minutes every 
single time. When the cache hit rate starts to degrade, you're looking at 
around 40~60 minutes. So this is a significant improvement to MTTA across the 
cluster. It also leads to *way* less work done in the Scheduling loop which is 
good for GC pressure.

I'm going to move forward and add logging and metrics to this patch and disable 
the feature by default since it requires such careful tuning to get right.


Repository: aurora


Description
-------

In the Dynamic Reservations review (and on the mailing list), I mentioned that 
we could implement update affinity with less complexity using the same 
technique as preemption. Here is how that would work. 

This just adds a simple wrapper around the preemptor's BiCache structure and 
then optimistically tries to keep an agent free for a task during the update 
process. 


Note: I don't bother even checking the resources before reserving the agent. I 
figure there is a chance the agent has enough room, and if not we'll catch it 
when we attempt to veto the offer. We need to always check the offer like this 
anyway in case constraints change. In the worst case it adds some delay in the 
rare cases you increase resources. 

We also don't persist the reservations, so if the Scheduler fails over during 
an update, the worst case is that any instances between the KILLED and ASSIGNED 
in-flight batch need to fall back to the current first-fit scheduling algorithm.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
203f62bacc47470545d095e4d25f7e0f25990ed9 
  src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
a177b301203143539b052524d14043ec8a85a46d 
  src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java 
b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 
  src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
f25dc0c6d9c05833b9938b023669c9c36a489f68 
  src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
c129896d8cd54abd2634e2a339c27921042b0162 
  
src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 
e14112479807b4477b82554caf84fe733f62cf58 
  src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
c95943d242dc2f539778bdc9e071f342005e8de3 
  src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java 
PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java 
13cbdadad606d9acaadc541320b22b0ae538cc5e 
  
src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java 
fa1a81785802b82542030e1aae786fe9570d9827 
  src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
cf2d25ec2e407df7159e0021ddb44adf937e1777 
  src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
b2c4c66850dd8f35e06a631809530faa3b776252 
  src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
c78c7fbd7d600586136863c99ce3d7387895efee 
  src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
30b44f88a5b8477e917da21d92361aea1a39ceeb 
  src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
833fd62c870f96b96343ee5e0eed0d439536381f 
  src/test/java/org/apache/aurora/scheduler/updater/NullAgentReserverTest.java 
PRE-CREATION 
  
src/test/java/org/apache/aurora/scheduler/updater/UpdateAgentReserverImplTest.java
 PRE-CREATION 


Diff: https://reviews.apache.org/r/58259/diff/2/


Testing
-------

./gradlew build
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh


File Attachments (updated)
----------------

Cache utilization over time
  
https://reviews.apache.org/media/uploaded/files/2017/04/25/7b41bd2b-4151-482c-9de2-9dee67c34133__declining-cache-hits.png
Offer rate from Mesos over time
  
https://reviews.apache.org/media/uploaded/files/2017/04/25/b107d964-ee7d-435a-a3d9-2b54f6eac3fa__consistent-offer-rate.png
Async task workload (scaled) correlation with degraded cache utilization
  
https://reviews.apache.org/media/uploaded/files/2017/04/25/7eaf37ac-fbf3-40eb-b3f6-90e914a3936f__async-task-correlation.png
cache hit rate before and after scheduler tuning
  
https://reviews.apache.org/media/uploaded/files/2017/05/02/39998e8d-2a75-4f5d-bfc0-bb93011407af__Screen_Shot_2017-05-01_at_6.30.18_PM.png
JobUpdateControllerImpl bottleneck
  
https://reviews.apache.org/media/uploaded/files/2017/05/02/f93484bd-c99e-4c01-9f8a-f0ad867adb26__Screen_Shot_2017-05-02_at_3.33.39_PM.png


Thanks,

David McLaughlin

Re: Review Request 58259: Add update affinity to Scheduler

Reply via email to