Ilya Pronin created MESOS-7376:
----------------------------------

             Summary: Long registry updates when the number of agents is high
                 Key: MESOS-7376
                 URL: https://issues.apache.org/jira/browse/MESOS-7376
             Project: Mesos
          Issue Type: Improvement
          Components: master
    Affects Versions: 1.3.0
            Reporter: Ilya Pronin
            Assignee: Ilya Pronin


During scale testing we discovered that as the number of registered agents 
grows the time it takes to update the registry grows to unacceptable values 
very fast. At some point it starts exceeding {{registry_store_timeout}} which 
doesn't fire.

With 55k agents we saw this ({{registry_store_timeout=20secs}}):
{noformat}
I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 
3.138843387secs; attempting to update the registry
I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 
74461ns
I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at 
position=6420881 in 2.41043644secs
I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has 
finished in 2.428189561secs (b=1)
I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the 
registry in 34.971944192secs
{noformat}

This is caused by repeated {{Registry}} copying which involves copying a big 
object graph that takes roughly 0.4 sec (with 55k agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to