[ 
https://issues.apache.org/jira/browse/MESOS-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Harper updated MESOS-8171:
------------------------------
    Description: 
Over the past year, the Marathon team has been plagued with an issue that hits 
our CI builds periodically in which the scheduler driver enters a tight loop, 
sending 10,000s of SUBSCRIBE calls to the master per second. I turned on debug 
logging for the client and the server, and it pointed to an issue with the 
{{doReliableRegistration}} method in sched.cpp. Here's the logs:

{code}
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.099815 
13397 process.cpp:1383] libprocess is initialized on 127.0.1.1:60957 with 8 
worker threads
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.118237 
13397 logging.cpp:199] Logging to STDERR
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.128921 
13416 sched.cpp:232] Version: 1.4.0
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151785 
13791 group.cpp:341] Group process (zookeeper-group(1)@127.0.1.1:60957) 
connected to ZooKeeper
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151823 
13791 group.cpp:831] Syncing group operations: queue size (joins, cancels, 
datas) = (0, 0, 0)
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151837 
13791 group.cpp:419] Trying to create path '/mesos' in ZooKeeper
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152586 
13791 group.cpp:758] Found non-sequence node 'log_replicas' at '/mesos' in 
ZooKeeper
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152662 
13791 detector.cpp:152] Detected a new leader: (id='0')
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152762 
13791 group.cpp:700] Trying to get '/mesos/json.info_0000000000' in ZooKeeper
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157148 
13791 zookeeper.cpp:262] A new leading master (UPID=master@172.16.10.95:32856) 
is detected
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157347 
13787 sched.cpp:336] New master detected at master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157557 
13787 sched.cpp:352] No credentials provided. Attempting to register without 
authentication
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157565 
13787 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157635 
13787 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.158979 
13785 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159029 
13785 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159265 
13790 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159303 
13790 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159479 
13786 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159521 
13786 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159622 
13788 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159658 
13788 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159749 
13789 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159785 
13789 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159878 
13792 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159916 
13792 sched.cpp:869] Will retry registration in 0ns if necessary
{code}

In Marathon, when we are running our tests, we set the failoverTimeout to 0 in 
order to cause the Mesos master to immediately forget about a framework when it 
disconnects.

On line 860 of sched.cpp, the retry-delay is set to 1/10th the failoverTimeout, 
which provides the best explanation for why the value is 0:

{code}
/Users/tim/src/m8e/mesos/src/sched/sched.cpp

 818 |   void doReliableRegistration(Duration maxBackoff)
 819 |   {
...
 851 |     // Bound the maximum backoff by 'REGISTRATION_RETRY_INTERVAL_MAX'.
 852 |     maxBackoff =
 853 |       std::min(maxBackoff, scheduler::REGISTRATION_RETRY_INTERVAL_MAX);
 854 | 
 855 |     // If failover timeout is present, bound the maximum backoff
 856 |     // by 1/10th of the failover timeout.
 857 |     if (framework.has_failover_timeout()) {
 858 |       Try<Duration> duration = 
Duration::create(framework.failover_timeout());
 859 |       if (duration.isSome()) {
 860 |         maxBackoff = std::min(maxBackoff, duration.get() / 10);
 861 |       }
 862 |     }
 863 | 
 864 |     // Determine the delay for next attempt by picking a random
 865 |     // duration between 0 and 'maxBackoff'.
 866 |     // TODO(vinod): Use random numbers from <random> header.
 867 |     Duration delay = maxBackoff * ((double) os::random() / RAND_MAX);
 868 | 
 869 |     VLOG(1) << "Will retry registration in " << delay << " if necessary";
 870 | 
 871 |     // Backoff.
 872 |     frameworkRegistrationTimer = process::delay(
 873 |         delay, self(), &Self::doReliableRegistration, maxBackoff * 2);
 874 |   }
 875 | 
{code}

Reading through the code, it seems that once this value is 0, it will always be 
zero, since backoff is multiplicative (0 * 2 == 0).


  was:
Over the past year, the Marathon team has been plagued with an issue that hits 
our CI builds periodically in which the scheduler driver enters a tight loop, 
sending 10,000s of SUBSCRIBE calls to the master per second. I turned on debug 
logging for the client and the server, and it pointed to an issue with the 
{{doReliableRegistration}} method in sched.cpp. Here's the logs:

{code}
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.099815 
13397 process.cpp:1383] libprocess is initialized on 127.0.1.1:60957 with 8 
worker threads
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.118237 
13397 logging.cpp:199] Logging to STDERR
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.128921 
13416 sched.cpp:232] Version: 1.4.0
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151785 
13791 group.cpp:341] Group process (zookeeper-group(1)@127.0.1.1:60957) 
connected to ZooKeeper
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151823 
13791 group.cpp:831] Syncing group operations: queue size (joins, cancels, 
datas) = (0, 0, 0)
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.151837 
13791 group.cpp:419] Trying to create path '/mesos' in ZooKeeper
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152586 
13791 group.cpp:758] Found non-sequence node 'log_replicas' at '/mesos' in 
ZooKeeper
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152662 
13791 detector.cpp:152] Detected a new leader: (id='0')
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.152762 
13791 group.cpp:700] Trying to get '/mesos/json.info_0000000000' in ZooKeeper
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157148 
13791 zookeeper.cpp:262] A new leading master (UPID=master@172.16.10.95:32856) 
is detected
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157347 
13787 sched.cpp:336] New master detected at master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157557 
13787 sched.cpp:352] No credentials provided. Attempting to register without 
authentication
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157565 
13787 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.157635 
13787 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.158979 
13785 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159029 
13785 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159265 
13790 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159303 
13790 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159479 
13786 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159521 
13786 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159622 
13788 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159658 
13788 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159749 
13789 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159785 
13789 sched.cpp:869] Will retry registration in 0ns if necessary
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159878 
13792 sched.cpp:836] Sending SUBSCRIBE call to master@172.16.10.95:32856
WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 05:39:39.159916 
13792 sched.cpp:869] Will retry registration in 0ns if necessary
{code}

In Marathon, when we are running our tests, we set the failoverTimeout to 0 in 
order to cause the Mesos master to immediately forget about a framework when it 
disconnects.

On line 860 of sched.cpp, the retry-delay is set to 1/10th the failoverTimeout, 
which provides the best explanation for why the value is 0:

{code}
/Users/tim/src/m8e/mesos/src/sched/sched.cpp

 818 |   void doReliableRegistration(Duration maxBackoff)
 819 |   {
...
 851 |     // Bound the maximum backoff by 'REGISTRATION_RETRY_INTERVAL_MAX'.
 852 |     maxBackoff =
 853 |       std::min(maxBackoff, scheduler::REGISTRATION_RETRY_INTERVAL_MAX);
 854 | 
 855 |     // If failover timeout is present, bound the maximum backoff
 856 |     // by 1/10th of the failover timeout.
 857 |     if (framework.has_failover_timeout()) {
 858 |       Try<Duration> duration = 
Duration::create(framework.failover_timeout());
 859 |       if (duration.isSome()) {
 860 |         maxBackoff = std::min(maxBackoff, duration.get() / 10);
 861 |       }
 862 |     }
 863 | 
 864 |     // Determine the delay for next attempt by picking a random
 865 |     // duration between 0 and 'maxBackoff'.
 866 |     // TODO(vinod): Use random numbers from <random> header.
 867 |     Duration delay = maxBackoff * ((double) os::random() / RAND_MAX);
 868 | 
 869 |     VLOG(1) << "Will retry registration in " << delay << " if necessary";
 870 | 
 871 |     // Backoff.
 872 |     frameworkRegistrationTimer = process::delay(
 873 |         delay, self(), &Self::doReliableRegistration, maxBackoff * 2);
 874 |   }
 875 | 
{code}

Reading through the code, it seems that once this value is 0, it will always be 
zero, since backoff is multiplicative.

Also, a separate issue, but, am I reading this wrong? It seems like maxBackoff 
will always approach zero, since it is multiplied by a number between {{[0, 
1.0]}}. Should it be adding this random amount, instead?




> Using a failoverTimeout of 0 with Mesos native scheduler client can result in 
> infinite subscribe loop
> -----------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8171
>                 URL: https://issues.apache.org/jira/browse/MESOS-8171
>             Project: Mesos
>          Issue Type: Bug
>          Components: c++ api, java api, scheduler driver
>    Affects Versions: 1.4.0
>            Reporter: Tim Harper
>            Priority: Minor
>
> Over the past year, the Marathon team has been plagued with an issue that 
> hits our CI builds periodically in which the scheduler driver enters a tight 
> loop, sending 10,000s of SUBSCRIBE calls to the master per second. I turned 
> on debug logging for the client and the server, and it pointed to an issue 
> with the {{doReliableRegistration}} method in sched.cpp. Here's the logs:
> {code}
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.099815 13397 process.cpp:1383] libprocess is initialized on 
> 127.0.1.1:60957 with 8 worker threads
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.118237 13397 logging.cpp:199] Logging to STDERR
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.128921 13416 sched.cpp:232] Version: 1.4.0
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.151785 13791 group.cpp:341] Group process 
> (zookeeper-group(1)@127.0.1.1:60957) connected to ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.151823 13791 group.cpp:831] Syncing group operations: queue size 
> (joins, cancels, datas) = (0, 0, 0)
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.151837 13791 group.cpp:419] Trying to create path '/mesos' in 
> ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.152586 13791 group.cpp:758] Found non-sequence node 'log_replicas' 
> at '/mesos' in ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.152662 13791 detector.cpp:152] Detected a new leader: (id='0')
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.152762 13791 group.cpp:700] Trying to get 
> '/mesos/json.info_0000000000' in ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157148 13791 zookeeper.cpp:262] A new leading master 
> (UPID=master@172.16.10.95:32856) is detected
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157347 13787 sched.cpp:336] New master detected at 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157557 13787 sched.cpp:352] No credentials provided. Attempting to 
> register without authentication
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157565 13787 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157635 13787 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.158979 13785 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159029 13785 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159265 13790 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159303 13790 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159479 13786 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159521 13786 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159622 13788 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159658 13788 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159749 13789 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159785 13789 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159878 13792 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159916 13792 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> {code}
> In Marathon, when we are running our tests, we set the failoverTimeout to 0 
> in order to cause the Mesos master to immediately forget about a framework 
> when it disconnects.
> On line 860 of sched.cpp, the retry-delay is set to 1/10th the 
> failoverTimeout, which provides the best explanation for why the value is 0:
> {code}
> /Users/tim/src/m8e/mesos/src/sched/sched.cpp
>  818 |   void doReliableRegistration(Duration maxBackoff)
>  819 |   {
> ...
>  851 |     // Bound the maximum backoff by 'REGISTRATION_RETRY_INTERVAL_MAX'.
>  852 |     maxBackoff =
>  853 |       std::min(maxBackoff, scheduler::REGISTRATION_RETRY_INTERVAL_MAX);
>  854 | 
>  855 |     // If failover timeout is present, bound the maximum backoff
>  856 |     // by 1/10th of the failover timeout.
>  857 |     if (framework.has_failover_timeout()) {
>  858 |       Try<Duration> duration = 
> Duration::create(framework.failover_timeout());
>  859 |       if (duration.isSome()) {
>  860 |         maxBackoff = std::min(maxBackoff, duration.get() / 10);
>  861 |       }
>  862 |     }
>  863 | 
>  864 |     // Determine the delay for next attempt by picking a random
>  865 |     // duration between 0 and 'maxBackoff'.
>  866 |     // TODO(vinod): Use random numbers from <random> header.
>  867 |     Duration delay = maxBackoff * ((double) os::random() / RAND_MAX);
>  868 | 
>  869 |     VLOG(1) << "Will retry registration in " << delay << " if 
> necessary";
>  870 | 
>  871 |     // Backoff.
>  872 |     frameworkRegistrationTimer = process::delay(
>  873 |         delay, self(), &Self::doReliableRegistration, maxBackoff * 2);
>  874 |   }
>  875 | 
> {code}
> Reading through the code, it seems that once this value is 0, it will always 
> be zero, since backoff is multiplicative (0 * 2 == 0).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to