> On May 26, 2017, 8:23 a.m., Stephan Erb wrote: > > I have thought about this in the past as well, and I am slightly skeptical > > if a simple sort by resources is a good approach when using > > oversubscription. Especially given the goal to run almost all > > (non-production) tasks as revocable. > > > > Offers do contain both revocable and non-revocable resources, so we would > > need to find an ordering that is suitable for both: > > > > * Sort by non-revocable resources first: We will pack the production tasks > > tightly but those are now pretty simple to schedule anyway, as most stuff > > is running as revocable. As the corresponding downside we don't reduce > > fragmentation for revocable resources. > > > > * Sorting by revocable resources: offers with the fewest revocable > > resources will be at the front. Given the nature of the oversubscription, > > these will be the nodes with the highest load (as those are capped to 0 > > revocable resources). If we schedule further non-revocable tasks on there, > > we will either get bad performance or trigger the termination of revocable > > tasks (depending on our `Estimator` and `QoSController`). > > > > > > Alternative first-fit solutions that come to mind (partially inspired by > > http://shonan.nii.ac.jp/shonan/seminar011/files/2012/02/stein.pdf) > > > > * Sort by agent id. Over time, this should converge to the same cluster > > state as sorting offers by size. The upside is > > that it would work for revocable and non-revocable resources > > simultaneously. > > > > * Compute a score per offer and then schedule on the node with the > > smallerst offer. As an approximation > > we could use something like `10 ** (free_ram/total_ram) + 10 ** > > (free_revocable_ram/total_revocable_ram) ...`. > > > > * Extend the sort by revocable resources with a special case for 0 > > remaining revocable resources.
I guess my main question here is: why wouldn't your alternative orderings not just be submitted as patches? Sorting offers is definitely not a panacea and I didn't intend to suggest that. Please see my reply to Santhosh too. I will clean up the README/docs if we decide to ship this. To clarify further: What this patch is trying to enable is the ability to influence global ordering that is better than random. So it's basically a chance to say "this is what I would prefer to happen for _all_ tasks in the cluster absent of any other information". E.g. at Twitter CPU is by far our scarcest resource and any bin-packing solution would be based almost solely on that. But what you really want is a score-based approach that makes smart decisions _per-task_. This is sort of what constraints are, except for bin-packing we don't want to avoid scheduling things when ideal circumstances cannot be found, but instead make sure we pick the best of a bad bunch. The problem is performance. Our steady-state at Twitter is about 300 calls to TaskAssigner.maybeAssign per minute - this is your crons, failing tasks and other recurring scheduling load. When engineers start deploying, we see that number increase to around 1~2k per minute. The current algorithm has a worst-case complexity of O(o·n) where o = number of offers and n = number of assign attempts. (Let's assume the filtering done per offer is a constant cost). Worst-case complexity is met when none of the offers match a task which isn't all that uncommon. With 2k attempts and 40k offers, this is already getting close to being a scheduling bottleneck for us, e.g. http://i.imgur.com/C9O8BN8.png. A score-based approach that ranks all offers based on what is ideal for the current task is much more complex. You need to build a sorted structure of all offers for each maybeAssign attempt. So you now have something like a O(o·log(o) · n) algorithm. You can optimize n by increasing the batch size, etc. but there's an upper bound on this (and a bunch of other negative trade-offs). So the only other approach is to limit o, either by maintaining smaller clusters (which is recommended in the Borg paper) or by sampling, where you score a fixed-size number of of offers rather than the entire set. We're still considering both approaches to reducing the number of offers. But with the latter approach of scoring only a subset of matching offers, what we've ran into is that with a random ordering of all offers (e.g. OfferManager.getOffers on master), the % of improvement in bin-packing vs first-fit and random is completely negligible without blowing up performance by increasing the sample size. One of the big problems is that as soon as a task is scheduled, the previous filtering decision on all offers are now potentially invalidated due to things like host and limit constraints. But when I added a CPU sort on the offers, the results were dramatically improved - only a couple % points away from a perfect bin-packing simulation. So that's the main motivation for this patch. Unfortunately we're stuck in the short term on this single-writer single-reader architecture for Aurora, so we can't even fan out this workload by increasing number of replicas, etc. - David ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/59480/#review176057 ----------------------------------------------------------- On May 23, 2017, 7:41 a.m., David McLaughlin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/59480/ > ----------------------------------------------------------- > > (Updated May 23, 2017, 7:41 a.m.) > > > Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb. > > > Repository: aurora > > > Description > ------- > > This patch enables scalable, high-performance Scheduler bin-packing using the > existing first-fit task assigner, and it can be controlled with a simple > command line argument. > > The bin-packing is only an approximation, but can lead to pretty significant > improvements in resource utilization per agent. For example, on a CPU-bound > cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to > reduce the number of hosts with tasks scheduled on them to just 90%, down > from 99.7% (as one would expect from randomization). So if you are running > Aurora on elastic computing and paying for machines by the minute/hour, then > utilizing this patch _could_ allow you to reduce your server footprint by as > much as 10%. > > The approximation is based on the simple idea that you have the best chance > of having perfect bin-packing if you put tasks in the smallest slot > available. So if you have a task needing 8 cores and you have an 8 core and > 12 core offer available - you'd always want to put the task in the 8 core > offer*. By sorting offers in OfferManager during iteration, then a first-fit > algorithm is guaranteed to match the smallest possible offer for your task > and achieves this. > > * - The correct decision of course depends on the other pending tasks and the > other resources available, and more satisfactory results may also need > preemption, etc. > > > Diffs > ----- > > RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf > src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java > f2296a9d7a88be7e43124370edecfe64415df00f > src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java > 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 > src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java > PRE-CREATION > src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java > adf7f33e4a72d87c3624f84dfe4998e20dc75fdc > src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java > 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 > src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java > d7addc0effb60c196cf339081ad81de541d05385 > src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java > 676d305d257585e53f0a05b359ba7eb11f1b23be > > > Diff: https://reviews.apache.org/r/59480/diff/1/ > > > Testing > ------- > > This has been scale-tested with production-like workloads and performs well, > adding only a few extra seconds total in TaskAssigner when applied to > thousands of tasks per minute. > > There is an overhead when scheduling tasks that have large resource > requirements - as the task assigner will first need to skip offer all the > offers with low resources. In a packed cluster, this is where the extra > seconds are spent. This could be reduced by just jumping over all the offers > we know to be too small, but that decision has to map to the OfferOrder > (which adds complexity). That can be addressed in a follow-up review if > needed. > > > Thanks, > > David McLaughlin > >
