[ 
https://issues.apache.org/jira/browse/MESOS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reopened MESOS-2891:
-------------------------------

Reopening because updateSlave (and likely updateAllocation) also need to be 
addressed.

Some numbers from a benchmark test.

{code}
[ RUN      ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0
Added 1000 slaves in 766.99568ms
Updated 1000 slaves in 6.807111421secs
[       OK ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/0 
(7751 ms)
[ RUN      ] SlaveCount/HierarchicalAllocator_BENCHMARK_Test.UpdateSlave/1
Added 5000 slaves in 3.886493374secs
Updated 5000 slaves in 4.07753897601667mins
[       OK ]
{code}


> Performance regression in hierarchical allocator.
> -------------------------------------------------
>
>                 Key: MESOS-2891
>                 URL: https://issues.apache.org/jira/browse/MESOS-2891
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation, master
>            Reporter: Benjamin Mahler
>            Assignee: Jie Yu
>            Priority: Blocker
>              Labels: twitter
>             Fix For: 0.23.0
>
>         Attachments: Screen Shot 2015-06-18 at 5.02.26 PM.png, perf-kernel.svg
>
>
> For large clusters, the 0.23.0 allocator cannot keep up with the volume of 
> slaves. After the following slave was re-registered, it took the allocator a 
> long time to work through the backlog of slaves to add:
> {noformat:title=45 minute delay}
> I0618 18:55:40.738399 10172 master.cpp:3419] Re-registered slave 
> 20150422-211121-2148346890-5050-3253-S4695
> I0618 19:40:14.960636 10164 hierarchical.hpp:496] Added slave 
> 20150422-211121-2148346890-5050-3253-S4695
> {noformat}
> Empirically, 
> [addSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L462]
>  and 
> [updateSlave|https://github.com/apache/mesos/blob/dda49e688c7ece603ac7a04a977fc7085c713dd1/src/master/allocator/mesos/hierarchical.hpp#L533]
>  have become expensive.
> Some timings from a production cluster reveal that the allocator spending in 
> the low tens of milliseconds for each call to {{addSlave}} and 
> {{updateSlave}}, when there are tens of thousands of slaves this amounts to 
> the large delay seen above.
> We also saw a slow steady increase in memory consumption, hinting further at 
> a queue backup in the allocator.
> A synthetic benchmark like we did for the registrar would be prudent here, 
> along with visibility into the allocator's queue size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to