[
https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964547#comment-15964547
]
Ilya Pronin edited comment on MESOS-7376 at 4/11/17 3:49 PM:
-------------------------------------------------------------
Review request: https://reviews.apache.org/r/58355/
{{Registry}} is being copied as a part of
{{state::protobuf::State::Variable<Registry>}} or as a bound parameter in
{{state::protobuf::State::store()}}. The former can be mitigated by adding move
support to {{Variable}} (using {{Swap()}} for protobuf message). The latter -
by using {{Owned}}. But that's not enough because {{Variable}} will still be
copied in return value propagation through the {{Future}}-s chain. So in my
patch I bypassed {{state::protobuf::State}}.
[~bmahler] can you shepherd this please?
h4. Benchmark results
Before:
{noformat}
I0411 10:04:11.726016 11802 registrar.cpp:508] Successfully updated the
registry in 89.478144ms
I0411 10:04:13.860827 11803 registrar.cpp:508] Successfully updated the
registry in 216.688896ms
I0411 10:04:15.167768 11803 registrar.cpp:508] Successfully updated the
registry in 1.29364096secs
I0411 10:04:18.967394 11803 registrar.cpp:508] Successfully updated the
registry in 3.696552192secs
I0411 10:04:25.631009 11803 registrar.cpp:508] Successfully updated the
registry in 6.267425024secs
I0411 10:04:42.625507 11803 registrar.cpp:508] Successfully updated the
registry in 15.876419072secs
I0411 10:04:44.209377 11787 registrar_tests.cpp:1262] Admitted 50000 agents in
30.479743816secs
I0411 10:05:04.446650 11820 registrar.cpp:508] Successfully updated the
registry in 18.338545152secs
I0411 10:05:21.171001 11820 registrar.cpp:508] Successfully updated the
registry in 15.31903872secs
I0411 10:05:37.592319 11820 registrar.cpp:508] Successfully updated the
registry in 14.863101952secs
I0411 10:05:39.099174 11787 registrar_tests.cpp:1276] Marked 50000 agents
reachable in 53.593596102secs
../../src/tests/registrar_tests.cpp:1287: Failure
Failed to wait 15secs for registry
{noformat}
After:
{noformat}
I0411 15:19:12.228904 40643 registrar.cpp:524] Successfully updated the
registry in 91.262208ms
I0411 15:19:14.543190 40660 registrar.cpp:524] Successfully updated the
registry in 377.45408ms
I0411 15:19:15.707006 40660 registrar.cpp:524] Successfully updated the
registry in 1.138724096secs
I0411 15:19:18.267305 40660 registrar.cpp:524] Successfully updated the
registry in 2.466145792secs
I0411 15:19:19.092073 40660 registrar.cpp:524] Successfully updated the
registry in 523.11296ms
I0411 15:19:20.809330 40648 registrar.cpp:524] Successfully updated the
registry in 892.141824ms
I0411 15:19:21.194135 40622 registrar_tests.cpp:1262] Admitted 50000 agents in
6.938952085secs
I0411 15:19:26.973904 40637 registrar.cpp:524] Successfully updated the
registry in 3.938064128secs
I0411 15:19:28.631865 40637 registrar.cpp:524] Successfully updated the
registry in 1.116326144secs
I0411 15:19:30.222944 40660 registrar.cpp:524] Successfully updated the
registry in 911.86688ms
I0411 15:19:30.678509 40622 registrar_tests.cpp:1276] Marked 50000 agents
reachable in 8.249523305secs
I0411 15:19:35.138797 40645 registrar.cpp:524] Successfully updated the
registry in 815.439104ms
I0411 15:19:41.783651 40622 registrar_tests.cpp:1288] Recovered 50000 agents
(8238915B) in 10.963297677secs
I0411 15:19:47.431670 40657 registrar.cpp:524] Successfully updated the
registry in 3.960920064secs
I0411 15:20:13.769872 40657 registrar.cpp:524] Successfully updated the
registry in 1.169234944secs
I0411 15:21:49.685801 40657 registrar.cpp:524] Successfully updated the
registry in 264.850688ms
Removed 50000 agents in 2.12256788111667mins
{noformat}
Similar picture in scale testing:
{noformat}
I0411 13:04:27.598438 41549 registrar.cpp:537] Successfully updated the
registry in 2.68846208secs
I0411 13:04:30.716615 41552 registrar.cpp:537] Successfully updated the
registry in 2.61457792secs
I0411 13:04:33.828133 41554 registrar.cpp:537] Successfully updated the
registry in 2.644827904secs
I0411 13:04:37.634577 41553 registrar.cpp:537] Successfully updated the
registry in 3.338414848secs
I0411 13:04:40.723475 41546 registrar.cpp:537] Successfully updated the
registry in 2.629734144secs
{noformat}
was (Author: ipronin):
Review request: https://reviews.apache.org/r/58355/
{{Registry}} is being copied as a part of
{{state::protobuf::State::Variable<Registry>}} or as a bound parameter in
{{state::protobuf::State::store()}}. The former can be mitigated by adding move
support to {{Variable}} (using {{Swap()}} for protobuf message). The latter -
by using {{Owned}}. But that's not enough because {{Variable}} will still be
copied in return value propagation through the {{Future}}-s chain. So in my
patch I bypassed {{state::protobuf::State}}.
[~bmahler] can you shepherd this please?
h4. Benchmark results
Before:
{noformat}
I0411 10:04:11.726016 11802 registrar.cpp:508] Successfully updated the
registry in 89.478144ms
I0411 10:04:13.860827 11803 registrar.cpp:508] Successfully updated the
registry in 216.688896ms
I0411 10:04:15.167768 11803 registrar.cpp:508] Successfully updated the
registry in 1.29364096secs
I0411 10:04:18.967394 11803 registrar.cpp:508] Successfully updated the
registry in 3.696552192secs
I0411 10:04:25.631009 11803 registrar.cpp:508] Successfully updated the
registry in 6.267425024secs
I0411 10:04:42.625507 11803 registrar.cpp:508] Successfully updated the
registry in 15.876419072secs
I0411 10:04:44.209377 11787 registrar_tests.cpp:1262] Admitted 50000 agents in
30.479743816secs
I0411 10:05:04.446650 11820 registrar.cpp:508] Successfully updated the
registry in 18.338545152secs
I0411 10:05:21.171001 11820 registrar.cpp:508] Successfully updated the
registry in 15.31903872secs
I0411 10:05:37.592319 11820 registrar.cpp:508] Successfully updated the
registry in 14.863101952secs
I0411 10:05:39.099174 11787 registrar_tests.cpp:1276] Marked 50000 agents
reachable in 53.593596102secs
../../src/tests/registrar_tests.cpp:1287: Failure
Failed to wait 15secs for registry
{noformat}
After:
{noformat}
I0411 15:19:12.228904 40643 registrar.cpp:524] Successfully updated the
registry in 91.262208ms
I0411 15:19:14.543190 40660 registrar.cpp:524] Successfully updated the
registry in 377.45408ms
I0411 15:19:15.707006 40660 registrar.cpp:524] Successfully updated the
registry in 1.138724096secs
I0411 15:19:18.267305 40660 registrar.cpp:524] Successfully updated the
registry in 2.466145792secs
I0411 15:19:19.092073 40660 registrar.cpp:524] Successfully updated the
registry in 523.11296ms
I0411 15:19:20.809330 40648 registrar.cpp:524] Successfully updated the
registry in 892.141824ms
I0411 15:19:21.194135 40622 registrar_tests.cpp:1262] Admitted 50000 agents in
6.938952085secs
I0411 15:19:26.973904 40637 registrar.cpp:524] Successfully updated the
registry in 3.938064128secs
I0411 15:19:28.631865 40637 registrar.cpp:524] Successfully updated the
registry in 1.116326144secs
I0411 15:19:30.222944 40660 registrar.cpp:524] Successfully updated the
registry in 911.86688ms
I0411 15:19:30.678509 40622 registrar_tests.cpp:1276] Marked 50000 agents
reachable in 8.249523305secs
I0411 15:19:35.138797 40645 registrar.cpp:524] Successfully updated the
registry in 815.439104ms
I0411 15:19:41.783651 40622 registrar_tests.cpp:1288] Recovered 50000 agents
(8238915B) in 10.963297677secs
I0411 15:19:47.431670 40657 registrar.cpp:524] Successfully updated the
registry in 3.960920064secs
I0411 15:20:13.769872 40657 registrar.cpp:524] Successfully updated the
registry in 1.169234944secs
I0411 15:21:49.685801 40657 registrar.cpp:524] Successfully updated the
registry in 264.850688ms
Removed 50000 agents in 2.12256788111667mins
{noformat}
> Long registry updates when the number of agents is high
> -------------------------------------------------------
>
> Key: MESOS-7376
> URL: https://issues.apache.org/jira/browse/MESOS-7376
> Project: Mesos
> Issue Type: Improvement
> Components: master
> Affects Versions: 1.3.0
> Reporter: Ilya Pronin
> Assignee: Ilya Pronin
>
> During scale testing we discovered that as the number of registered agents
> grows the time it takes to update the registry grows to unacceptable values
> very fast. At some point it starts exceeding {{registry_store_timeout}} which
> doesn't fire.
> With 55k agents we saw this ({{registry_store_timeout=20secs}}):
> {noformat}
> I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in
> 3.138843387secs; attempting to update the registry
> I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in
> 74461ns
> I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
> I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at
> position=6420881 in 2.41043644secs
> I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has
> finished in 2.428189561secs (b=1)
> I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the
> registry in 34.971944192secs
> {noformat}
> This is caused by repeated {{Registry}} copying which involves copying a big
> object graph that takes roughly 0.4 sec (with 55k agents).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)