[ 
https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964547#comment-15964547
 ] 

Ilya Pronin commented on MESOS-7376:
------------------------------------

Review request: https://reviews.apache.org/r/58355/

{{Registry}} is being copied as a part of 
{{state::protobuf::State::Variable<Registry>}} or as a bound parameter in 
{{state::protobuf::State::store()}}. The former can be mitigated by adding move 
support to {{Variable}} (using {{Swap()}} for protobuf message). The latter - 
by using {{Owned}}. But that's not enough because {{Variable}} will still be 
copied in return value propagation through the {{Future}}-s chain. So in my 
patch I bypassed {{state::protobuf::State}}.

[~bmahler] can you shepherd this please?

h4. Benchmark results
Before:
{noformat}
I0411 10:04:11.726016 11802 registrar.cpp:508] Successfully updated the 
registry in 89.478144ms
I0411 10:04:13.860827 11803 registrar.cpp:508] Successfully updated the 
registry in 216.688896ms
I0411 10:04:15.167768 11803 registrar.cpp:508] Successfully updated the 
registry in 1.29364096secs
I0411 10:04:18.967394 11803 registrar.cpp:508] Successfully updated the 
registry in 3.696552192secs
I0411 10:04:25.631009 11803 registrar.cpp:508] Successfully updated the 
registry in 6.267425024secs
I0411 10:04:42.625507 11803 registrar.cpp:508] Successfully updated the 
registry in 15.876419072secs
I0411 10:04:44.209377 11787 registrar_tests.cpp:1262] Admitted 50000 agents in 
30.479743816secs
I0411 10:05:04.446650 11820 registrar.cpp:508] Successfully updated the 
registry in 18.338545152secs
I0411 10:05:21.171001 11820 registrar.cpp:508] Successfully updated the 
registry in 15.31903872secs
I0411 10:05:37.592319 11820 registrar.cpp:508] Successfully updated the 
registry in 14.863101952secs
I0411 10:05:39.099174 11787 registrar_tests.cpp:1276] Marked 50000 agents 
reachable in 53.593596102secs
../../src/tests/registrar_tests.cpp:1287: Failure
Failed to wait 15secs for registry
{noformat}
After:
{noformat}
I0411 15:19:12.228904 40643 registrar.cpp:524] Successfully updated the 
registry in 91.262208ms
I0411 15:19:14.543190 40660 registrar.cpp:524] Successfully updated the 
registry in 377.45408ms
I0411 15:19:15.707006 40660 registrar.cpp:524] Successfully updated the 
registry in 1.138724096secs
I0411 15:19:18.267305 40660 registrar.cpp:524] Successfully updated the 
registry in 2.466145792secs
I0411 15:19:19.092073 40660 registrar.cpp:524] Successfully updated the 
registry in 523.11296ms
I0411 15:19:20.809330 40648 registrar.cpp:524] Successfully updated the 
registry in 892.141824ms
I0411 15:19:21.194135 40622 registrar_tests.cpp:1262] Admitted 50000 agents in 
6.938952085secs
I0411 15:19:26.973904 40637 registrar.cpp:524] Successfully updated the 
registry in 3.938064128secs
I0411 15:19:28.631865 40637 registrar.cpp:524] Successfully updated the 
registry in 1.116326144secs
I0411 15:19:30.222944 40660 registrar.cpp:524] Successfully updated the 
registry in 911.86688ms
I0411 15:19:30.678509 40622 registrar_tests.cpp:1276] Marked 50000 agents 
reachable in 8.249523305secs
I0411 15:19:35.138797 40645 registrar.cpp:524] Successfully updated the 
registry in 815.439104ms
I0411 15:19:41.783651 40622 registrar_tests.cpp:1288] Recovered 50000 agents 
(8238915B) in 10.963297677secs
I0411 15:19:47.431670 40657 registrar.cpp:524] Successfully updated the 
registry in 3.960920064secs
I0411 15:20:13.769872 40657 registrar.cpp:524] Successfully updated the 
registry in 1.169234944secs
I0411 15:21:49.685801 40657 registrar.cpp:524] Successfully updated the 
registry in 264.850688ms
Removed 50000 agents in 2.12256788111667mins
{noformat}

> Long registry updates when the number of agents is high
> -------------------------------------------------------
>
>                 Key: MESOS-7376
>                 URL: https://issues.apache.org/jira/browse/MESOS-7376
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.3.0
>            Reporter: Ilya Pronin
>            Assignee: Ilya Pronin
>
> During scale testing we discovered that as the number of registered agents 
> grows the time it takes to update the registry grows to unacceptable values 
> very fast. At some point it starts exceeding {{registry_store_timeout}} which 
> doesn't fire.
> With 55k agents we saw this ({{registry_store_timeout=20secs}}):
> {noformat}
> I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 
> 3.138843387secs; attempting to update the registry
> I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 
> 74461ns
> I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
> I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at 
> position=6420881 in 2.41043644secs
> I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has 
> finished in 2.428189561secs (b=1)
> I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the 
> registry in 34.971944192secs
> {noformat}
> This is caused by repeated {{Registry}} copying which involves copying a big 
> object graph that takes roughly 0.4 sec (with 55k agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to