> On April 25, 2017, 3:01 a.m., David McLaughlin wrote: > > We have complete initial scale testing of this patch with updates spanning > > 10 to 10k instances across 10k agents. Here are the findings: > > > > 1) The patch works great for small and medium sized updates. > > 2) For large updates things start with significant performance upgrades and > > then eventually degrade, causing cache hits to degrade to almost 0% (where > > it resorts to performance on master). > > 3) Initially we believed the offers were taking too long due to compaction, > > but the overhead there turned out to be only a couple of seconds. > > 4) We believe we have root caused the degrading cache hits to interference > > from the task history pruner. > > 5) Expanding the timeout to 2 minutes doesn't seem to help either, the > > performance degradation due to (4) is quite severe. > > > > See attached screenshots. > > > > Anecdotally, this explains an issue we've frequently witnessed when > > extremely large services (5~8k instances) caused cluster-wide slowdown even > > when capacity was readily available. > > > > Next steps are to confirm and address the task history pruning issue. > > David McLaughlin wrote: > Another update: > > After (a lot) of testing, we tracked this down to the scheduling penalty > in TaskGroups. Unfortunately there is a bug in the penalty metric calculation > (the counter isn't incremented when no tasks in a batch manage to be > scheduled) which meant we falsely ruled this out. After ruling out GC and the > async workers, we revisited the metric calculation and discovered the bug. > From there, we were able to tune various settings to improve cache hit > performance. But there are also sometimes still cases where cache hit % > degrades to 0 and stays there for large updates. > > Tuning is complicated because you have to consider different update batch > sizes vs number of concurrent updates vs max schedule attempts vs tasks per > group (and every other setting in SchedulingModule really). On top of all of > this, you also need to tune carefully to avoid being adversely affected by > your chronically failing and permanently pending tasks too. > > The goal is to make sure the tasks waiting for reservations to be freed > up aren't punished too heavily, without also repeating work for bad actors. > > Probably the worst property is the fact that once you start getting cache > misses, it's very hard to recover - this is because a cache miss falls back > to the regular scheduling algorithm which can also fail to finding matching > offers and this only adds to the delay. > > We could probably avoid most of these issues if we could somehow connect > the killing of tasks for updates into the currently scheduling throughput... > but that would require a huge refactor. > > Currently we manage 100% cache hit with high number of concurrent updates > (~1k+ instances updated per minute) by lowering the worst case scheduling > penalty and increasing the number of tasks considered per job. > > It's also worth noting we'd also see the behavior we've ran into with > dynamic reservations that had 1 minute timeouts. > > Stephan Erb wrote: > Thanks a lot for keeping us posted! > > Three questions: > > a) Do your TaskGroups findings rule out the influence of the task history > pruner? Or do you already have a workaround for it? Looking at the code, it > seems to be quadratic in the size of terminated tasks, so it could very > likely have an effect on scheduler throughput as well (as each terminated > instance will trigger an async function looking at all terminated instances). > > b) What scheduling penalty and what number of tasks per scheduling > attempt did you end up using? > > c) Have you considered relaxing the affinity to be based on > `TaskGroupKey` rather than per `InstanceKey`? As this is an optimization > meant for scheduling throughput and not for persistent offers, we don't > really care which instance re-uses a slot. A scheduled instance might thus > use one of the `batch_size` reservations, even if the reservation of its > previous slot has long expired.
a) I believe so. Although we should still improve the task history pruning algorithm. b) To get the results in my latest chart we had an update reservation hold time of 5mins, a max penalty of 5secs and tasks per job of 25. Note that bumping the reservation hold time to 2 minutes alone without changing the other settings only delays the problem by a few minutes (because of the binary backoff associated with the pending penalty). To obtain high throughput without having a steadily increasing MTTA, it is only reducing the max penalty and increasing the tasks per job that helps. One idea I had was to avoid using the TruncatedBinaryBackoff when update reservations are present (this would require passing the UpdateAgentReserver to TaskGroups.. starting to worry about leaky abstractions there). c) I started with TaskGroupKey initially, but had to switch to using InstanceKey to support multiple tasks from the same job on the same host. - David ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/58259/#review172889 ----------------------------------------------------------- On May 2, 2017, 1:32 a.m., David McLaughlin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/58259/ > ----------------------------------------------------------- > > (Updated May 2, 2017, 1:32 a.m.) > > > Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer > Manji. > > > Repository: aurora > > > Description > ------- > > In the Dynamic Reservations review (and on the mailing list), I mentioned > that we could implement update affinity with less complexity using the same > technique as preemption. Here is how that would work. > > This just adds a simple wrapper around the preemptor's BiCache structure and > then optimistically tries to keep an agent free for a task during the update > process. > > > Note: I don't bother even checking the resources before reserving the agent. > I figure there is a chance the agent has enough room, and if not we'll catch > it when we attempt to veto the offer. We need to always check the offer like > this anyway in case constraints change. In the worst case it adds some delay > in the rare cases you increase resources. > > We also don't persist the reservations, so if the Scheduler fails over during > an update, the worst case is that any instances between the KILLED and > ASSIGNED in-flight batch need to fall back to the current first-fit > scheduling algorithm. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java > f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 > src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java > 203f62bacc47470545d095e4d25f7e0f25990ed9 > src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java > a177b301203143539b052524d14043ec8a85a46d > src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java > b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 > > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java > f25dc0c6d9c05833b9938b023669c9c36a489f68 > src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java > c129896d8cd54abd2634e2a339c27921042b0162 > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java > e14112479807b4477b82554caf84fe733f62cf58 > src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java > c95943d242dc2f539778bdc9e071f342005e8de3 > src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java > PRE-CREATION > src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java > 13cbdadad606d9acaadc541320b22b0ae538cc5e > > src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java > fa1a81785802b82542030e1aae786fe9570d9827 > src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java > cf2d25ec2e407df7159e0021ddb44adf937e1777 > src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java > b2c4c66850dd8f35e06a631809530faa3b776252 > src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java > c78c7fbd7d600586136863c99ce3d7387895efee > src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java > 30b44f88a5b8477e917da21d92361aea1a39ceeb > src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java > 833fd62c870f96b96343ee5e0eed0d439536381f > > src/test/java/org/apache/aurora/scheduler/updater/NullAgentReserverTest.java > PRE-CREATION > > src/test/java/org/apache/aurora/scheduler/updater/UpdateAgentReserverImplTest.java > PRE-CREATION > > > Diff: https://reviews.apache.org/r/58259/diff/2/ > > > Testing > ------- > > ./gradlew build > ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh > > > File Attachments > ---------------- > > Cache utilization over time > > https://reviews.apache.org/media/uploaded/files/2017/04/25/7b41bd2b-4151-482c-9de2-9dee67c34133__declining-cache-hits.png > Offer rate from Mesos over time > > https://reviews.apache.org/media/uploaded/files/2017/04/25/b107d964-ee7d-435a-a3d9-2b54f6eac3fa__consistent-offer-rate.png > Async task workload (scaled) correlation with degraded cache utilization > > https://reviews.apache.org/media/uploaded/files/2017/04/25/7eaf37ac-fbf3-40eb-b3f6-90e914a3936f__async-task-correlation.png > cache hit rate before and after scheduler tuning > > https://reviews.apache.org/media/uploaded/files/2017/05/02/39998e8d-2a75-4f5d-bfc0-bb93011407af__Screen_Shot_2017-05-01_at_6.30.18_PM.png > > > Thanks, > > David McLaughlin > >
