-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176057
-----------------------------------------------------------



I have thought about this in the past as well, and I am slightly skeptical if a 
simple sort by resources is a good approach when using oversubscription. 
Especially given the goal to run almost all (non-production) tasks as revocable.

Offers do contain both revocable and non-revocable resources, so we would need 
to find an ordering that is suitable for both:

* Sort by non-revocable resources first: We will pack the production tasks 
tightly but those are now pretty simple to schedule anyway, as most stuff is 
running as revocable. As the corresponding downside we don't reduce 
fragmentation for revocable resources.

* Sorting by revocable resources: offers with the fewest revocable resources 
will be at the front. Given the nature of the oversubscription, these will be 
the nodes with the highest load (as those are capped to 0 revocable resources). 
If we schedule further non-revocable tasks on there, we will either get bad 
performance or trigger the termination of revocable tasks (depending on our 
`Estimator` and `QoSController`).


Alternative first-fit solutions that come to mind (partially inspired by 
http://shonan.nii.ac.jp/shonan/seminar011/files/2012/02/stein.pdf)

* Sort by agent id. Over time, this should converge to the same cluster state 
as sorting offers by size. The upside is 
  that it would work for revocable and non-revocable resources simultaneously.

* Compute a score per offer and then schedule on the node with the smallerst 
offer. As an approximation
  we could use something like `10 ** (free_ram/total_ram) + 10 ** 
(free_revocable_ram/total_revocable_ram) ...`. 

* Extend the sort by revocable resources with a special case for 0 remaining 
revocable resources.

- Stephan Erb


On May 23, 2017, 9:41 a.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 23, 2017, 9:41 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the 
> existing first-fit task assigner, and it can be controlled with a simple 
> command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant 
> improvements in resource utilization per agent. For example, on a CPU-bound 
> cluster with 30k+ hosts and 135k tasks (across 1k+ jobs) - we were able to 
> reduce the number of hosts with tasks scheduled on them to just 90%, down 
> from 99.7% (as one would expect from randomization). So if you are running 
> Aurora on elastic computing and paying for machines by the minute/hour, then 
> utilizing this patch _could_ allow you to reduce your server footprint by as 
> much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance 
> of having perfect bin-packing if you put tasks in the smallest slot 
> available. So if you have a task needing 8 cores and you have an 8 core and 
> 12 core offer available - you'd always want to put the task in the 8 core 
> offer*. By sorting offers in OfferManager during iteration, then a first-fit 
> algorithm is guaranteed to match the smallest possible offer for your task 
> and achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the 
> other resources available, and more satisfactory results may also need 
> preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 77376e438bd7af74c364dcd5d1b3e3f1ece2adbf 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
> f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java 
> adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 
> 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 
> d7addc0effb60c196cf339081ad81de541d05385 
>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 
> 676d305d257585e53f0a05b359ba7eb11f1b23be 
> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/1/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, 
> adding only a few extra seconds total in TaskAssigner when applied to 
> thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource 
> requirements - as the task assigner will first need to skip offer all the 
> offers with low resources. In a packed cluster, this is where the extra 
> seconds are spent. This could be reduced by just jumping over all the offers 
> we know to be too small, but that decision has to map to the OfferOrder 
> (which adds complexity). That can be addressed in a follow-up review if 
> needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>

Reply via email to