Dario Rexin created MESOS-4694:
----------------------------------
Summary: DRFAllocator takes very long to allocate resources with a
large number of frameworks
Key: MESOS-4694
URL: https://issues.apache.org/jira/browse/MESOS-4694
Project: Mesos
Issue Type: Bug
Components: allocation
Affects Versions: 0.26.0, 0.27.0, 0.27.1
Reporter: Dario Rexin
Assignee: Dario Rexin
With a growing number of connected frameworks, the allocation time grows to
very high numbers. The addition of quota in 0.27 had an additional impact on
these numbers. Running `mesos-tests.sh --benchmark
--gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us the
following numbers:
{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 200 frameworks
round 0 allocate took 2.921202secs to make 200 offers
round 1 allocate took 2.85045secs to make 200 offers
round 2 allocate took 2.823768secs to make 200 offers
{noformat}
Increasing the number of frameworks to 2000:
{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 2000 frameworks
round 0 allocate took 28.209454secs to make 2000 offers
round 1 allocate took 28.469419secs to make 2000 offers
round 2 allocate took 28.138086secs to make 2000 offers
{noformat}
I was able to reduce this time by a substantial amount. After applying the
patches:
{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 200 frameworks
round 0 allocate took 1.016226secs to make 2000 offers
round 1 allocate took 1.102729secs to make 2000 offers
round 2 allocate took 1.102624secs to make 2000 offers
{noformat}
And with 2000 frameworks:
{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 2000 frameworks
round 0 allocate took 12.563203secs to make 2000 offers
round 1 allocate took 12.437517secs to make 2000 offers
round 2 allocate took 12.470708secs to make 2000 offers
{noformat}
The patches do 3 things to improve the performance of the allocator.
1) The total values in the DRFSorter will be pre calculated per resource type
2) In the allocate method, when no resources are available to allocate, we
break out of the innermost loop to prevent looping over a large number of
frameworks when we have nothing to allocate
3) when a framework suppresses offers, we remove it from the sorter instead of
just calling continue in the allocation loop - this greatly improves
performance in the sorter and prevents looping over frameworks that don't need
resources
Assuming that most of the frameworks behave nicely and suppress offers when
they have nothing to schedule, it is fair to assume, that point 3) has the
biggest impact on the performance. If we suppress offers for 90% of the
frameworks in the benchmark test, we see following numbers:
{noformat}
==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 200 slaves and 2000 frameworks
round 0 allocate took 11626us to make 200 offers
round 1 allocate took 22890us to make 200 offers
round 2 allocate took 21346us to make 200 offers
{noformat}
And for 200 frameworks:
{noformat}
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from HierarchicalAllocator_BENCHMARK_Test
[ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers
Using 2000 slaves and 2000 frameworks
round 0 allocate took 1.11178secs to make 2000 offers
round 1 allocate took 1.062649secs to make 2000 offers
round 2 allocate took 1.080181secs to make 2000 offers
{noformat}
Review requests:
https://reviews.apache.org/r/43665/
https://reviews.apache.org/r/43666/
https://reviews.apache.org/r/43668/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)