[
https://issues.apache.org/jira/browse/MESOS-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418115#comment-16418115
]
Kapil Arya commented on MESOS-8171:
-----------------------------------
Backported the fix to 1.4.x and 1.3.x branches.
> Using a failoverTimeout of 0 with Mesos native scheduler client can result in
> infinite subscribe loop
> -----------------------------------------------------------------------------------------------------
>
> Key: MESOS-8171
> URL: https://issues.apache.org/jira/browse/MESOS-8171
> Project: Mesos
> Issue Type: Bug
> Components: c++ api, java api, scheduler driver
> Affects Versions: 1.1.3, 1.2.2, 1.3.1, 1.4.0
> Reporter: Tim Harper
> Assignee: Meng Zhu
> Priority: Minor
> Labels: mesosphere
> Fix For: 1.4.2, 1.5.0, 1.3.3
>
>
> Over the past year, the Marathon team has been plagued with an issue that
> hits our CI builds periodically in which the scheduler driver enters a tight
> loop, sending 10,000s of SUBSCRIBE calls to the master per second. I turned
> on debug logging for the client and the server, and it pointed to an issue
> with the {{doReliableRegistration}} method in sched.cpp. Here's the logs:
> {code}
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.099815 13397 process.cpp:1383] libprocess is initialized on
> 127.0.1.1:60957 with 8 worker threads
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.118237 13397 logging.cpp:199] Logging to STDERR
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.128921 13416 sched.cpp:232] Version: 1.4.0
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.151785 13791 group.cpp:341] Group process
> (zookeeper-group(1)@127.0.1.1:60957) connected to ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.151823 13791 group.cpp:831] Syncing group operations: queue size
> (joins, cancels, datas) = (0, 0, 0)
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.151837 13791 group.cpp:419] Trying to create path '/mesos' in
> ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.152586 13791 group.cpp:758] Found non-sequence node 'log_replicas'
> at '/mesos' in ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.152662 13791 detector.cpp:152] Detected a new leader: (id='0')
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.152762 13791 group.cpp:700] Trying to get
> '/mesos/json.info_0000000000' in ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.157148 13791 zookeeper.cpp:262] A new leading master
> ([email protected]:32856) is detected
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.157347 13787 sched.cpp:336] New master detected at
> [email protected]:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.157557 13787 sched.cpp:352] No credentials provided. Attempting to
> register without authentication
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.157565 13787 sched.cpp:836] Sending SUBSCRIBE call to
> [email protected]:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.157635 13787 sched.cpp:869] Will retry registration in 0ns if
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.158979 13785 sched.cpp:836] Sending SUBSCRIBE call to
> [email protected]:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159029 13785 sched.cpp:869] Will retry registration in 0ns if
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159265 13790 sched.cpp:836] Sending SUBSCRIBE call to
> [email protected]:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159303 13790 sched.cpp:869] Will retry registration in 0ns if
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159479 13786 sched.cpp:836] Sending SUBSCRIBE call to
> [email protected]:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159521 13786 sched.cpp:869] Will retry registration in 0ns if
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159622 13788 sched.cpp:836] Sending SUBSCRIBE call to
> [email protected]:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159658 13788 sched.cpp:869] Will retry registration in 0ns if
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159749 13789 sched.cpp:836] Sending SUBSCRIBE call to
> [email protected]:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159785 13789 sched.cpp:869] Will retry registration in 0ns if
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159878 13792 sched.cpp:836] Sending SUBSCRIBE call to
> [email protected]:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104
> 05:39:39.159916 13792 sched.cpp:869] Will retry registration in 0ns if
> necessary
> {code}
> In Marathon, when we are running our tests, we set the failoverTimeout to 0
> in order to cause the Mesos master to immediately forget about a framework
> when it disconnects.
> On line 860 of sched.cpp, the retry-delay is set to 1/10th the
> failoverTimeout, which provides the best explanation for why the value is 0:
> {code}
> ./mesos/src/sched/sched.cpp
> 818 | void doReliableRegistration(Duration maxBackoff)
> 819 | {
> ...
> 851 | // Bound the maximum backoff by 'REGISTRATION_RETRY_INTERVAL_MAX'.
> 852 | maxBackoff =
> 853 | std::min(maxBackoff, scheduler::REGISTRATION_RETRY_INTERVAL_MAX);
> 854 |
> 855 | // If failover timeout is present, bound the maximum backoff
> 856 | // by 1/10th of the failover timeout.
> 857 | if (framework.has_failover_timeout()) {
> 858 | Try<Duration> duration =
> Duration::create(framework.failover_timeout());
> 859 | if (duration.isSome()) {
> 860 | maxBackoff = std::min(maxBackoff, duration.get() / 10);
> 861 | }
> 862 | }
> 863 |
> 864 | // Determine the delay for next attempt by picking a random
> 865 | // duration between 0 and 'maxBackoff'.
> 866 | // TODO(vinod): Use random numbers from <random> header.
> 867 | Duration delay = maxBackoff * ((double) os::random() / RAND_MAX);
> 868 |
> 869 | VLOG(1) << "Will retry registration in " << delay << " if
> necessary";
> 870 |
> 871 | // Backoff.
> 872 | frameworkRegistrationTimer = process::delay(
> 873 | delay, self(), &Self::doReliableRegistration, maxBackoff * 2);
> 874 | }
> 875 |
> {code}
> Reading through the code, it seems that once this value is 0, it will always
> be zero, since backoff is multiplicative (0 * 2 == 0), and the
> failover_timeout / 10 limit is applied each time.
> To make matters worse, failoverTimeout of {{0}} is the default:
> {code}
> ./mesos/include/mesos/mesos.proto
> 238 | // The amount of time (in seconds) that the master will wait for the
> 239 | // scheduler to failover before it tears down the framework by
> 240 | // killing all its tasks/executors. This should be non-zero if a
> 241 | // framework expects to reconnect after a failure and not lose its
> 242 | // tasks/executors.
> 243 | //
> 244 | // NOTE: To avoid accidental destruction of tasks, production
> 245 | // frameworks typically set this to a large value (e.g., 1 week).
> 246 | optional double failover_timeout = 4 [default = 0.0];
> 247 |
> {code}
> I've confirmed that when using 1.4.0 of the Mesos client java jar, this
> default is used if failoverTimeout is not set:
> {code}
> @ import $ivy.`org.apache.mesos:mesos:1.4.0`
> import $ivy.$
> @ import org.apache.mesos.Protos
> import org.apache.mesos.Protos
> @
> Protos.FrameworkInfo.newBuilder.setName("test").setUser("user").build.getFailoverTimeout
> res3: Double = 0.0
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)