Re: Review Request 29507: Added Configurable Slave Ping Timeouts

Niklas Nielsen Tue, 17 Feb 2015 16:59:32 -0800


> On Jan. 7, 2015, 2:50 p.m., Ben Mahler wrote:
> > src/master/flags.hpp, lines 361-373
> > <https://reviews.apache.org/r/29507/diff/1/?file=804645#file804645line361>
> >
> >     Do you mind changing the names now that they're merely defaults?
> >     
> >     s/SLAVE_PING_TIMEOUT/DEFAULT_SLAVE_PING_TIMEOUT/
> >     s/MAX_SLAVE_PING_TIMEOUTS/DEFAULT_MAX_SLAVE_PING_TIMEOUTS/
> >     
> >     Once you do this, it will become apparent there is more needed in this 
> > change. In particular, all the slave logic that depends on 
> > MASTER_PING_TIMEOUT() is broken now! Making it configurable will require 
> > some more thought.
> 
> Adam B wrote:
>     Renamed to make the DEFAULT nature explicit.
>     
>     I would argue that with this change as is, the slave side of the ping 
> timeout isn't much more broken than it already was. As I alluded to in the 
> MESOS-2110 JIRA description, the slave uses the timeout to force a 
> leader-detection & re-registration, which is only loosely related to what the 
> master is doing with the timeout (declaring the slave shutdown). Even without 
> my changes, the Timers on master/slave could be out of sync and the two 
> timeout events aren't guaranteed to happen in a particular order. If the 
> master were to timeout first, and the slave did eventually detect a new 
> leader, the slave would still be told to shutdown and its re-registration 
> would fail. If the slave timed out first, it would try to re-register with 
> the currently verified leader (could be the same), and the master's 
> SlaveObserver would keep trying to ping the slave and potentially eventually 
> shut it down. That's why I would argue that this patch is fine on its own, 
> and a subsequent patch (same JIRA?) sh
 ould clean up the slave's ping timeout behavior.
>     
>     `src/master/constants.hpp` has the following note:
>     ```
>     // NOTE: The slave uses these PING constants to determine when
>     // the master has stopped sending pings. If these are made
>     // configurable, then we'll need to rely on upper/lower bounds
>     // to ensure that the slave is not unnecessarily triggering
>     // re-registrations.
>     ```
>     If the master flags are configurable, there's currently no way for the 
> slave to know what value the current leading master is using, especially if 
> the slave is disconnected at the time. After reading through the comments in 
> MESOS-1529 (which added the slave-side ping timeout), I'm thinking that maybe 
> the master->slave registered (and PING?) messages could embed the current 
> master's actual ping timeout. Then the pingTimers started from 
> Slave::registered() and Slave::ping() could use that value (same as master's) 
> to prevent unnecessary reregistration due to a timeout mismatch.
>     
>     Thoughts, suggestions?
> 
> Ben Mahler wrote:
>     > I would argue that with this change as is, the slave side of the ping 
> timeout isn't much more broken than it already was.
>     
>     Hm.. I think you misunderstood the intent of the slave using this timeout 
> (it's not trying to prevent the things you discussed, please clarify if you 
> think it's causing some issues, I didn't quite understand what you're 
> pointing out).
>     
>     These timers are in place to guarantee that if the master removes a slave 
> (stops health checking it), then the slave will eventually terminate. Without 
> this timeout, it's possible for the master -> slave removal messages to drop 
> while the slave -> master socket remains intact. At this point, (without the 
> timeout) the slave would remain running potentially forever (unless a message 
> (e.g. status update) is sent to the master for the master to detect the 
> zombied slave).
>     
>     The slave also uses the pings to prevent a slave from getting "stuck" in 
> a situation where the master considers the slave disconnected, but the slave 
> considers itself connected. This can happen via a one-way socket closure from 
> master -> slave. Note that the health checking in the master continues while 
> the slave is disconnected.
>     
>     The comments here should help:
>     
>     ```
>     void Slave::ping(const UPID& from, bool connected)
>     {
>       VLOG(1) << "Received ping from " << from;
>     
>       if (!connected && state == RUNNING) {
>         // This could happen if there is a one way partition between
>         // the master and slave, causing the master to get an exited
>         // event and marking the slave disconnected but the slave
>         // thinking it is still connected. Force a re-registration with
>         // the master to reconcile.
>         LOG(INFO) << "Master marked the slave as disconnected but the slave"
>                   << " considers itself registered! Forcing re-registration.";
>         detection.discard();
>       }
>     
>       // If we don't get a ping from the master, trigger a
>       // re-registration. This can occur when the master no
>       // longer considers the slave to be registered, so it is
>       // essential for the slave to attempt a re-registration
>       // when this occurs.
>       Clock::cancel(pingTimer);
>     
>       pingTimer = delay(
>           MASTER_PING_TIMEOUT(),
>           self(),
>           &Slave::pingTimeout,
>           detection);
>     
>       send(from, PongSlaveMessage());
>     }
>     ```
>     
>     Based on all of this, it's really important to understand the 
> implications of the slave under and over estimating the timeout on the master 
> side. For example, if the slave under-estimates and always triggers 
> re-registration, that would be problematic on large clusters (40k slaves / 75 
> seconds = 500+ re-registrations / second in the steady state!). That's why a 
> TODO mentioned using upper/lower bounds for this.
> 
> Adam B wrote:
>     Thank you for reviewing/discussing. I hadn't thought much before about 
> the one-way socket closure scenario, but it doesn't drastically change my 
> perspective. As you mention, it's important to consider the implications of 
> the slave misestimating the timeout, especially with a configurable timeout 
> where a master and slave could be at different versions/configurations during 
> a rolling upgrade. There must be one source of truth for this timeout, rather 
> than allowing the slave to "estimate" or have its own timeout configuration. 
> So I still propose that the master should pass `master_ping_timeout` to the 
> slave via SlaveRe[re]gisteredMessage, so that any slave that has ever 
> considered itself registered/RUNNING knows the master's actual value and will 
> wait at least as long as the master, no matter if/when/how the socket closes. 
> We could also pass the timeout in the PingSlaveMessage if we're paranoid 
> about the master changing its ping timeout value without triggering a 
> reregistration. I
 f you aren't opposed, I can put together a patch and some tests to demonstrate.
>     
>     Even with the master communicating its timeout so that both sides use the 
> same value, at the point of a two-way network split whichever side last 
> received a Ping/Pong will start its timer later and consequently time out 
> later due to the non-zero communication latency. In the event of a one-way 
> socket closure where the slave can still reach the master, a successful Pong 
> will always follow a successful Ping, so the slave will time out first.
>     If the master times out first, it will send a ShutdownSlaveMessage, 
> remove the slave, and terminate the SlaveObserver, never trying to contact 
> the slave again. The slave may continue trying to send status updates, etc. 
> in the meantime (which might go through in a one-way disconnect), but the 
> slave will eventually timeout, disconnect itself (so it stops sending 
> messages), and consult the MasterDetector for a leader to reregister with. If 
> the master cannot reach the slave, it is likely that the ZK quorum cannot 
> reach it either, but if the slave can reach the ZK quorum, or is configured 
> with a static master, it will indeed try to re-authenticate/register, using a 
> backoff for registration, but not for authentication (TODO).
>     If the slave times out first, it will unnecessarily try to 
> reregister/authenticate even though the master thinks it is still registered. 
> If the master continues to send messages to the slave, they will all be 
> dropped with a warning logged, except for Ping messages which will try to 
> reply with a Pong. If the reregister/authenticate/pong messages from the 
> slave cannot reach the master, the master will eventually remove the slave. 
> If the slave is able to reregister before the master removes it (i.e. brief 
> network connectivity issue), all will go back to normal, but the whole 
> headache could have been avoided if the slave had timed out after the master.
>     To try to enforce ordering (master always times out first), we could have 
> the master timeout after `slave_ping_timeout*max_slave_ping_timeouts` and 
> have the slave timeout after 
> `slave_ping_timeout*(max_slave_ping_timeouts+1)`, passing these values from 
> the master to slave as previously discussed.
>     Thoughts?


Ping - did you guys reach to an agreement?


- Niklas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review67113
-----------------------------------------------------------


On Jan. 8, 2015, 12:44 a.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated Jan. 8, 2015, 12:44 a.m.)
> 
> 
> Review request for mesos and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to replace the existing (still default) 
> SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> 
> Diffs
> -----
> 
>   src/master/constants.hpp c386eab 
>   src/master/constants.cpp 9ee17e9 
>   src/master/flags.hpp f5c8d2a 
>   src/master/master.cpp d6651e2 
>   src/slave/constants.hpp fd1c1ab 
>   src/slave/constants.cpp 2a99b11 
>   src/slave/slave.cpp 50b5781 
>   src/tests/fault_tolerance_tests.cpp 5763486 
>   src/tests/partition_tests.cpp fea7801 
>   src/tests/slave_recovery_tests.cpp cd4a398 
>   src/tests/slave_tests.cpp c50cbc7 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different 
> --slave_ping_timeout and --max_slave_ping_timeouts.
> Updated unit tests to use shorter non-default values for ping timeouts.
> 
> 
> Thanks,
> 
> Adam B
> 
>

Re: Review Request 29507: Added Configurable Slave Ping Timeouts

Reply via email to