Re: Review Request 58259: Add update affinity to Scheduler

David McLaughlin Tue, 18 Apr 2017 11:13:33 -0700


> On April 17, 2017, 8:27 p.m., Stephan Erb wrote:
> > The code change looks decent to me. 
> > 
> > However, I am unsure about two things:
> > 
> > * For us it is common to have jobs with #instance in the ballpark of 
> > #agents. The proposed code change could easily block a significant number 
> > of agents for scheduling, even if there would be enough capacity for other 
> > job instances. So while improving the MTTA for job udpates, this could 
> > easily lead to increased MTTA for regularly launched jobs (cron, adhoc, 
> > etc).
> > * The alternative dynamic reservation proposal has the advantage that it 
> > works when multiple frameworks are used. Would it be plausible to just 
> > reserve any used resources in a generic fashion, so that we ensure 
> > reservations always come back to Aurora and cannot be intercepted by 
> > another framework?
> > 
> > Please run `./gradlew jmh -Pbenchmarks='SchedulingBenchmarks.*'` to help 
> > ensure the scheduling changes don't come with an unexpected performance 
> > regression.
> 
> David McLaughlin wrote:
>     I think for (1) you described a problem that wouldn't be an issue in 
> clusters with decent amounts of capacity available. It's only really an issue 
> in low capacity clusters. And this change is specifically targetting the use 
> case you mentioned (big, hard-to-schedule task of an important production job 
> being killed for an update and some low priority task like a cron taking its 
> place and then the prod job not being able to be scheduled.. triggering 
> preemption and churn across the cluster - rinse, repeat for thousands of 
> instances of a task). 
>     
>     We run Aurora as a single framework, so can't really speak to (2). I 
> think though you'd just want Dynamic Reservations for this? Is that what 
> you're suggesting? Now we're back to the other approach which also has a 
> bunch of open questions.
>     
>     To be clear - this approach has one major difference I care about: it 
> does not expose this to users via a new tier. In practice it means we don't 
> need to ask people to opt in to what is essentially caching, and we also 
> don't need to expose the reserved tier for users (Twitter also has the use 
> case where we want to expose user-managed dynamic reservations via some 
> reserved tier).
> 
> Stephan Erb wrote:
>     (1) Yeah good point. We will probably have to see how this behaves in 
> pratice on smaller clusters. I have also realized that the batch size, rather 
> than #instances is the limiting factor.
>     
>     (2) What I was aiming at is probably orthogonal to the implementation 
> itself: In a multi-framework world neither the preemptor nor this affinity 
> patch will work nicely. Aurora will release resources and expect those come 
> back. They probably never will. 
>     
>     The question is though: Rather than going to the trouble of conditionally 
> reserving resources using a tier setting, would it be feasible to 
> unconditionally reserve all resources Aurora uses? That way we could 
> guarantee they always bounce back. If we don't need them any longer, we could 
> unreserve them. As stated, this could be independent of the patch here, as it 
> would also apply to preemption.

Yeah, my understanding is that Aurora isn't the best neighbor in a 
multi-framework environment. 

I don't have deep knowledge here, but even with role quotas at the Mesos layer 
the problem is still there right? Because Mesos just shares the resources but 
doesn't partition agents in any way? 

Making all of our tasks run in dynamic reservations is something we discussed 
here. Gut reaction is that the main downside is the added complexity of the 
reconciliation layer, which would probably become crucial to performance - 
particularly if you're doing this for adhoc jobs and crons, etc. I'd have 
concerns about how it plays with revocable and preemptible resources 
(specifically - when you need to kill multiple preemptible tasks on a box to 
make way for prod, we'd need an atomic "kill and merge these offers" and I'm 
not sure that would work in multi-framework anyway). 

When I thought about this problem I saw two directions for the existing patches:

1) This is a patch to add a simple, best-effort caching layer to job updates to 
avoid repeating scheduling logic. The system still needs to be performant if 
there is a cache miss or the cache is flushed for any reason. You can disable 
the cache if you're getting a lot of cache misses because of multi-frameworks, 
or you can leave it enabled and tune the cache to reduce GC and deploy times 
for your customers. 
2) There is a secondary concern to add a reserved tier to Aurora that uses 
Dynamic Reservations to have much firmer guarantees that offers are held for 
certain tiers of jobs. This has a huge amount of potential uses. We'd probably 
need to incorporate the new partition-aware status updates from Mesos and have 
a much deeper discussion about what controls users should have over their 
reservations. 

But most importantly, they are not mutually exclusive. Perhaps we can move 
forward with this but disable it by default? Then it's something you just turn 
on and tune if you're having performance issues. And we evaluate the Dynamic 
Reservations patch purely on how we want to expose a reserved tier to users.

> On April 17, 2017, 8:27 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java
> > Lines 52-55 (patched)
> > <https://reviews.apache.org/r/58259/diff/2/?file=1689854#file1689854line52>
> >
> >     I am trying to understand if this is a good default for this 
> > best-effort feature.
> >     
> >     What is your cluster-wide MTTA? It should give us a decent hint for a 
> > suitable default.
> 
> David McLaughlin wrote:
>     Our MTTA can range from a couple milliseconds to several minutes. Depends 
> how many tasks are pending and how full the cluster is.
> 
> Stephan Erb wrote:
>     If I understand this correctly, this patch will help the "good case" but 
> could fall down quickly during overload: If the cluster is getting overloaded 
> with pending tasks, the 1 min timeout might not be sufficient to place a job 
> in its reserved spot. This will then lead to preemptions that further 
> aggregate the overload situation. 
>     
>     We will need a counter to track those expired reservations.

Yup, definitely need more metrics here. If the community gives the overall 
approach a +1, I will move forward with making this production ready.

- David

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58259/#review172122
-----------------------------------------------------------

On April 12, 2017, 7:51 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58259/
> -----------------------------------------------------------
> 
> (Updated April 12, 2017, 7:51 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer 
> Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In the Dynamic Reservations review (and on the mailing list), I mentioned 
> that we could implement update affinity with less complexity using the same 
> technique as preemption. Here is how that would work. 
> 
> This just adds a simple wrapper around the preemptor's BiCache structure and 
> then optimistically tries to keep an agent free for a task during the update 
> process. 
> 
> 
> Note: I don't bother even checking the resources before reserving the agent. 
> I figure there is a chance the agent has enough room, and if not we'll catch 
> it when we attempt to veto the offer. We need to always check the offer like 
> this anyway in case constraints change. In the worst case it adds some delay 
> in the rare cases you increase resources. 
> 
> We also don't persist the reservations, so if the Scheduler fails over during 
> an update, the worst case is that any instances between the KILLED and 
> ASSIGNED in-flight batch need to fall back to the current first-fit 
> scheduling algorithm.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 203f62bacc47470545d095e4d25f7e0f25990ed9 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> a177b301203143539b052524d14043ec8a85a46d 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java 
> b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> f25dc0c6d9c05833b9938b023669c9c36a489f68 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  e14112479807b4477b82554caf84fe733f62cf58 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java 
> 13cbdadad606d9acaadc541320b22b0ae538cc5e 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java
>  fa1a81785802b82542030e1aae786fe9570d9827 
>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
> cf2d25ec2e407df7159e0021ddb44adf937e1777 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> b2c4c66850dd8f35e06a631809530faa3b776252 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> c78c7fbd7d600586136863c99ce3d7387895efee 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 30b44f88a5b8477e917da21d92361aea1a39ceeb 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> 833fd62c870f96b96343ee5e0eed0d439536381f 
>   
> src/test/java/org/apache/aurora/scheduler/updater/NullAgentReserverTest.java 
> PRE-CREATION 
>   
> src/test/java/org/apache/aurora/scheduler/updater/UpdateAgentReserverImplTest.java
>  PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/58259/diff/2/
> 
> 
> Testing
> -------
> 
> ./gradlew build
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>

Re: Review Request 58259: Add update affinity to Scheduler

Reply via email to