Re: Review Request 29507: Added Configurable Slave Ping Timeouts

Adam B Sun, 25 Jan 2015 02:57:07 -0800


> On Jan. 7, 2015, 2:50 p.m., Ben Mahler wrote:
> > src/master/flags.hpp, lines 361-373
> > <https://reviews.apache.org/r/29507/diff/1/?file=804645#file804645line361>
> >
> >     Do you mind changing the names now that they're merely defaults?
> >     
> >     s/SLAVE_PING_TIMEOUT/DEFAULT_SLAVE_PING_TIMEOUT/
> >     s/MAX_SLAVE_PING_TIMEOUTS/DEFAULT_MAX_SLAVE_PING_TIMEOUTS/
> >     
> >     Once you do this, it will become apparent there is more needed in this 
> > change. In particular, all the slave logic that depends on 
> > MASTER_PING_TIMEOUT() is broken now! Making it configurable will require 
> > some more thought.
> 
> Adam B wrote:
>     Renamed to make the DEFAULT nature explicit.
>     
>     I would argue that with this change as is, the slave side of the ping 
> timeout isn't much more broken than it already was. As I alluded to in the 
> MESOS-2110 JIRA description, the slave uses the timeout to force a 
> leader-detection & re-registration, which is only loosely related to what the 
> master is doing with the timeout (declaring the slave shutdown). Even without 
> my changes, the Timers on master/slave could be out of sync and the two 
> timeout events aren't guaranteed to happen in a particular order. If the 
> master were to timeout first, and the slave did eventually detect a new 
> leader, the slave would still be told to shutdown and its re-registration 
> would fail. If the slave timed out first, it would try to re-register with 
> the currently verified leader (could be the same), and the master's 
> SlaveObserver would keep trying to ping the slave and potentially eventually 
> shut it down. That's why I would argue that this patch is fine on its own, 
> and a subsequent patch (same JIRA?) sh
 ould clean up the slave's ping timeout behavior.
>     
>     `src/master/constants.hpp` has the following note:
>     ```
>     // NOTE: The slave uses these PING constants to determine when
>     // the master has stopped sending pings. If these are made
>     // configurable, then we'll need to rely on upper/lower bounds
>     // to ensure that the slave is not unnecessarily triggering
>     // re-registrations.
>     ```
>     If the master flags are configurable, there's currently no way for the 
> slave to know what value the current leading master is using, especially if 
> the slave is disconnected at the time. After reading through the comments in 
> MESOS-1529 (which added the slave-side ping timeout), I'm thinking that maybe 
> the master->slave registered (and PING?) messages could embed the current 
> master's actual ping timeout. Then the pingTimers started from 
> Slave::registered() and Slave::ping() could use that value (same as master's) 
> to prevent unnecessary reregistration due to a timeout mismatch.
>     
>     Thoughts, suggestions?
> 
> Ben Mahler wrote:
>     > I would argue that with this change as is, the slave side of the ping 
> timeout isn't much more broken than it already was.
>     
>     Hm.. I think you misunderstood the intent of the slave using this timeout 
> (it's not trying to prevent the things you discussed, please clarify if you 
> think it's causing some issues, I didn't quite understand what you're 
> pointing out).
>     
>     These timers are in place to guarantee that if the master removes a slave 
> (stops health checking it), then the slave will eventually terminate. Without 
> this timeout, it's possible for the master -> slave removal messages to drop 
> while the slave -> master socket remains intact. At this point, (without the 
> timeout) the slave would remain running potentially forever (unless a message 
> (e.g. status update) is sent to the master for the master to detect the 
> zombied slave).
>     
>     The slave also uses the pings to prevent a slave from getting "stuck" in 
> a situation where the master considers the slave disconnected, but the slave 
> considers itself connected. This can happen via a one-way socket closure from 
> master -> slave. Note that the health checking in the master continues while 
> the slave is disconnected.
>     
>     The comments here should help:
>     
>     ```
>     void Slave::ping(const UPID& from, bool connected)
>     {
>       VLOG(1) << "Received ping from " << from;
>     
>       if (!connected && state == RUNNING) {
>         // This could happen if there is a one way partition between
>         // the master and slave, causing the master to get an exited
>         // event and marking the slave disconnected but the slave
>         // thinking it is still connected. Force a re-registration with
>         // the master to reconcile.
>         LOG(INFO) << "Master marked the slave as disconnected but the slave"
>                   << " considers itself registered! Forcing re-registration.";
>         detection.discard();
>       }
>     
>       // If we don't get a ping from the master, trigger a
>       // re-registration. This can occur when the master no
>       // longer considers the slave to be registered, so it is
>       // essential for the slave to attempt a re-registration
>       // when this occurs.
>       Clock::cancel(pingTimer);
>     
>       pingTimer = delay(
>           MASTER_PING_TIMEOUT(),
>           self(),
>           &Slave::pingTimeout,
>           detection);
>     
>       send(from, PongSlaveMessage());
>     }
>     ```
>     
>     Based on all of this, it's really important to understand the 
> implications of the slave under and over estimating the timeout on the master 
> side. For example, if the slave under-estimates and always triggers 
> re-registration, that would be problematic on large clusters (40k slaves / 75 
> seconds = 500+ re-registrations / second in the steady state!). That's why a 
> TODO mentioned using upper/lower bounds for this.

Thank you for reviewing/discussing. I hadn't thought much before about the 
one-way socket closure scenario, but it doesn't drastically change my 
perspective. As you mention, it's important to consider the implications of the 
slave misestimating the timeout, especially with a configurable timeout where a 
master and slave could be at different versions/configurations during a rolling 
upgrade. There must be one source of truth for this timeout, rather than 
allowing the slave to "estimate" or have its own timeout configuration. So I 
still propose that the master should pass `master_ping_timeout` to the slave 
via SlaveRe[re]gisteredMessage, so that any slave that has ever considered 
itself registered/RUNNING knows the master's actual value and will wait at 
least as long as the master, no matter if/when/how the socket closes. We could 
also pass the timeout in the PingSlaveMessage if we're paranoid about the 
master changing its ping timeout value without triggering a reregistration. If 
you 
 aren't opposed, I can put together a patch and some tests to demonstrate.

Even with the master communicating its timeout so that both sides use the same 
value, at the point of a two-way network split whichever side last received a 
Ping/Pong will start its timer later and consequently time out later due to the 
non-zero communication latency. In the event of a one-way socket closure where 
the slave can still reach the master, a successful Pong will always follow a 
successful Ping, so the slave will time out first.
If the master times out first, it will send a ShutdownSlaveMessage, remove the 
slave, and terminate the SlaveObserver, never trying to contact the slave 
again. The slave may continue trying to send status updates, etc. in the 
meantime (which might go through in a one-way disconnect), but the slave will 
eventually timeout, disconnect itself (so it stops sending messages), and 
consult the MasterDetector for a leader to reregister with. If the master 
cannot reach the slave, it is likely that the ZK quorum cannot reach it either, 
but if the slave can reach the ZK quorum, or is configured with a static 
master, it will indeed try to re-authenticate/register, using a backoff for 
registration, but not for authentication (TODO).
If the slave times out first, it will unnecessarily try to 
reregister/authenticate even though the master thinks it is still registered. 
If the master continues to send messages to the slave, they will all be dropped 
with a warning logged, except for Ping messages which will try to reply with a 
Pong. If the reregister/authenticate/pong messages from the slave cannot reach 
the master, the master will eventually remove the slave. If the slave is able 
to reregister before the master removes it (i.e. brief network connectivity 
issue), all will go back to normal, but the whole headache could have been 
avoided if the slave had timed out after the master.
To try to enforce ordering (master always times out first), we could have the 
master timeout after `slave_ping_timeout*max_slave_ping_timeouts` and have the 
slave timeout after `slave_ping_timeout*(max_slave_ping_timeouts+1)`, passing 
these values from the master to slave as previously discussed.
Thoughts?

- Adam

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review67113
-----------------------------------------------------------

On Jan. 8, 2015, 12:44 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated Jan. 8, 2015, 12:44 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -----
> 
>   src/master/constants.hpp c386eab 
>   src/master/constants.cpp 9ee17e9 
>   src/master/flags.hpp f5c8d2a 
>   src/master/master.cpp d6651e2 
>   src/slave/constants.hpp fd1c1ab 
>   src/slave/constants.cpp 2a99b11 
>   src/slave/slave.cpp 50b5781 
>   src/tests/fault_tolerance_tests.cpp 5763486 
>   src/tests/partition_tests.cpp fea7801 
>   src/tests/slave_recovery_tests.cpp cd4a398 
>   src/tests/slave_tests.cpp c50cbc7 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>

Re: Review Request 29507: Added Configurable Slave Ping Timeouts

Reply via email to