> On May 26, 2017, 8:23 a.m., Stephan Erb wrote:
> > I have thought about this in the past as well, and I am slightly skeptical 
> > if a simple sort by resources is a good approach when using 
> > oversubscription. Especially given the goal to run almost all 
> > (non-production) tasks as revocable.
> > 
> > Offers do contain both revocable and non-revocable resources, so we would 
> > need to find an ordering that is suitable for both:
> > 
> > * Sort by non-revocable resources first: We will pack the production tasks 
> > tightly but those are now pretty simple to schedule anyway, as most stuff 
> > is running as revocable. As the corresponding downside we don't reduce 
> > fragmentation for revocable resources.
> > 
> > * Sorting by revocable resources: offers with the fewest revocable 
> > resources will be at the front. Given the nature of the oversubscription, 
> > these will be the nodes with the highest load (as those are capped to 0 
> > revocable resources). If we schedule further non-revocable tasks on there, 
> > we will either get bad performance or trigger the termination of revocable 
> > tasks (depending on our `Estimator` and `QoSController`).
> > 
> > 
> > Alternative first-fit solutions that come to mind (partially inspired by 
> > http://shonan.nii.ac.jp/shonan/seminar011/files/2012/02/stein.pdf)
> > 
> > * Sort by agent id. Over time, this should converge to the same cluster 
> > state as sorting offers by size. The upside is 
> >   that it would work for revocable and non-revocable resources 
> > simultaneously.
> > 
> > * Compute a score per offer and then schedule on the node with the 
> > smallerst offer. As an approximation
> >   we could use something like `10 ** (free_ram/total_ram) + 10 ** 
> > (free_revocable_ram/total_revocable_ram) ...`. 
> > 
> > * Extend the sort by revocable resources with a special case for 0 
> > remaining revocable resources.

I guess my main question here is: why wouldn't your alternative orderings not 
just be submitted as patches?


Sorting offers is definitely not a panacea and I didn't intend to suggest that. 
Please see my reply to Santhosh too. I will clean up the README/docs if we 
decide to ship this.  


To clarify further: 

What this patch is trying to enable is the ability to influence global ordering 
that is better than random. So it's basically a chance to say "this is what I 
would prefer to happen for _all_ tasks in the cluster absent of any other 
information". E.g. at Twitter CPU is by far our scarcest resource and any 
bin-packing solution would be based almost solely on that. 

But what you really want is a score-based approach that makes smart decisions 
_per-task_. This is sort of what constraints are, except for bin-packing we 
don't want to avoid scheduling things when ideal circumstances cannot be found, 
but instead make sure we pick the best of a bad bunch. The problem is 
performance. 

Our steady-state at Twitter is about 300 calls to TaskAssigner.maybeAssign per 
minute - this is your crons, failing tasks and other recurring scheduling load. 
When engineers start deploying, we see that number increase to around 1~2k per 
minute. 

The current algorithm has a worst-case complexity of O(o·n) where o = number of 
offers and n = number of assign attempts. (Let's assume the filtering done per 
offer is a constant cost). Worst-case complexity is met when none of the offers 
match a task which isn't all that uncommon. With 2k attempts and 40k offers, 
this is already getting close to being a scheduling bottleneck for us, e.g. 
http://i.imgur.com/C9O8BN8.png.


A score-based approach that ranks all offers based on what is ideal for the 
current task is much more complex. You need to build a sorted structure of all 
offers for each maybeAssign attempt. So you now have something like a 
O(o·log(o) · n) algorithm. You can optimize n by increasing the batch size, 
etc. but there's an upper bound on this (and a bunch of other negative 
trade-offs). So the only other approach is to limit o, either by maintaining 
smaller clusters (which is recommended in the Borg paper) or by sampling, where 
you score a fixed-size number of of offers rather than the entire set. 

We're still considering both approaches to reducing the number of offers. But 
with the latter approach of scoring only a subset of matching offers, what 
we've ran into is that with a random ordering of all offers (e.g. 
OfferManager.getOffers on master), the % of improvement in bin-packing vs 
first-fit and random is completely negligible without blowing up performance by 
increasing the sample size. One of the big problems is that as soon as a task 
is scheduled, the previous filtering decision on all offers are now potentially 
invalidated due to things like host and limit constraints. But when I added a 
CPU sort on the offers, the results were dramatically improved - only a couple 
% points away from a perfect bin-packing simulation. So that's the main 
motivation for this patch. 

Unfortunately we're stuck in the short term on this single-writer single-reader 
architecture for Aurora, so we can't even fan out this workload by increasing 
number of replicas, etc.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176057
-----------------------------------------------------------


On May 23, 2017, 7:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 7:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the 
> existing first-fit task assigner, and it can be controlled with a simple 
> command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant 
> improvements in resource utilization per agent. For example, on a CPU-bound 
> cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to 
> reduce the number of hosts with tasks scheduled on them to just 90%, down 
> from 99.7% (as one would expect from randomization). So if you are running 
> Aurora on elastic computing and paying for machines by the minute/hour, then 
> utilizing this patch _could_ allow you to reduce your server footprint by as 
> much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance 
> of having perfect bin-packing if you put tasks in the smallest slot 
> available. So if you have a task needing 8 cores and you have an 8 core and 
> 12 core offer available - you'd always want to put the task in the 8 core 
> offer*. By sorting offers in OfferManager during iteration, then a first-fit 
> algorithm is guaranteed to match the smallest possible offer for your task 
> and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the 
> other resources available, and more satisfactory results may also need 
> preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
> f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java 
> adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 
> 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 
> 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, 
> adding only a few extra seconds total in TaskAssigner when applied to 
> thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource 
> requirements - as the task assigner will first need to skip offer all the 
> offers with low resources. In a packed cluster, this is where the extra 
> seconds are spent. This could be reduced by just jumping over all the offers 
> we know to be too small, but that decision has to map to the OfferOrder 
> (which adds complexity). That can be addressed in a follow-up review if 
> needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>

Reply via email to