> On Jan. 7, 2015, 2:50 p.m., Ben Mahler wrote: > > src/master/flags.hpp, lines 361-373 > > <https://reviews.apache.org/r/29507/diff/1/?file=804645#file804645line361> > > > > Do you mind changing the names now that they're merely defaults? > > > > s/SLAVE_PING_TIMEOUT/DEFAULT_SLAVE_PING_TIMEOUT/ > > s/MAX_SLAVE_PING_TIMEOUTS/DEFAULT_MAX_SLAVE_PING_TIMEOUTS/ > > > > Once you do this, it will become apparent there is more needed in this > > change. In particular, all the slave logic that depends on > > MASTER_PING_TIMEOUT() is broken now! Making it configurable will require > > some more thought. > > Adam B wrote: > Renamed to make the DEFAULT nature explicit. > > I would argue that with this change as is, the slave side of the ping > timeout isn't much more broken than it already was. As I alluded to in the > MESOS-2110 JIRA description, the slave uses the timeout to force a > leader-detection & re-registration, which is only loosely related to what the > master is doing with the timeout (declaring the slave shutdown). Even without > my changes, the Timers on master/slave could be out of sync and the two > timeout events aren't guaranteed to happen in a particular order. If the > master were to timeout first, and the slave did eventually detect a new > leader, the slave would still be told to shutdown and its re-registration > would fail. If the slave timed out first, it would try to re-register with > the currently verified leader (could be the same), and the master's > SlaveObserver would keep trying to ping the slave and potentially eventually > shut it down. That's why I would argue that this patch is fine on its own, > and a subsequent patch (same JIRA?) sh ould clean up the slave's ping timeout behavior. > > `src/master/constants.hpp` has the following note: > ``` > // NOTE: The slave uses these PING constants to determine when > // the master has stopped sending pings. If these are made > // configurable, then we'll need to rely on upper/lower bounds > // to ensure that the slave is not unnecessarily triggering > // re-registrations. > ``` > If the master flags are configurable, there's currently no way for the > slave to know what value the current leading master is using, especially if > the slave is disconnected at the time. After reading through the comments in > MESOS-1529 (which added the slave-side ping timeout), I'm thinking that maybe > the master->slave registered (and PING?) messages could embed the current > master's actual ping timeout. Then the pingTimers started from > Slave::registered() and Slave::ping() could use that value (same as master's) > to prevent unnecessary reregistration due to a timeout mismatch. > > Thoughts, suggestions? > > Ben Mahler wrote: > > I would argue that with this change as is, the slave side of the ping > timeout isn't much more broken than it already was. > > Hm.. I think you misunderstood the intent of the slave using this timeout > (it's not trying to prevent the things you discussed, please clarify if you > think it's causing some issues, I didn't quite understand what you're > pointing out). > > These timers are in place to guarantee that if the master removes a slave > (stops health checking it), then the slave will eventually terminate. Without > this timeout, it's possible for the master -> slave removal messages to drop > while the slave -> master socket remains intact. At this point, (without the > timeout) the slave would remain running potentially forever (unless a message > (e.g. status update) is sent to the master for the master to detect the > zombied slave). > > The slave also uses the pings to prevent a slave from getting "stuck" in > a situation where the master considers the slave disconnected, but the slave > considers itself connected. This can happen via a one-way socket closure from > master -> slave. Note that the health checking in the master continues while > the slave is disconnected. > > The comments here should help: > > ``` > void Slave::ping(const UPID& from, bool connected) > { > VLOG(1) << "Received ping from " << from; > > if (!connected && state == RUNNING) { > // This could happen if there is a one way partition between > // the master and slave, causing the master to get an exited > // event and marking the slave disconnected but the slave > // thinking it is still connected. Force a re-registration with > // the master to reconcile. > LOG(INFO) << "Master marked the slave as disconnected but the slave" > << " considers itself registered! Forcing re-registration."; > detection.discard(); > } > > // If we don't get a ping from the master, trigger a > // re-registration. This can occur when the master no > // longer considers the slave to be registered, so it is > // essential for the slave to attempt a re-registration > // when this occurs. > Clock::cancel(pingTimer); > > pingTimer = delay( > MASTER_PING_TIMEOUT(), > self(), > &Slave::pingTimeout, > detection); > > send(from, PongSlaveMessage()); > } > ``` > > Based on all of this, it's really important to understand the > implications of the slave under and over estimating the timeout on the master > side. For example, if the slave under-estimates and always triggers > re-registration, that would be problematic on large clusters (40k slaves / 75 > seconds = 500+ re-registrations / second in the steady state!). That's why a > TODO mentioned using upper/lower bounds for this. > > Adam B wrote: > Thank you for reviewing/discussing. I hadn't thought much before about > the one-way socket closure scenario, but it doesn't drastically change my > perspective. As you mention, it's important to consider the implications of > the slave misestimating the timeout, especially with a configurable timeout > where a master and slave could be at different versions/configurations during > a rolling upgrade. There must be one source of truth for this timeout, rather > than allowing the slave to "estimate" or have its own timeout configuration. > So I still propose that the master should pass `master_ping_timeout` to the > slave via SlaveRe[re]gisteredMessage, so that any slave that has ever > considered itself registered/RUNNING knows the master's actual value and will > wait at least as long as the master, no matter if/when/how the socket closes. > We could also pass the timeout in the PingSlaveMessage if we're paranoid > about the master changing its ping timeout value without triggering a > reregistration. I f you aren't opposed, I can put together a patch and some tests to demonstrate. > > Even with the master communicating its timeout so that both sides use the > same value, at the point of a two-way network split whichever side last > received a Ping/Pong will start its timer later and consequently time out > later due to the non-zero communication latency. In the event of a one-way > socket closure where the slave can still reach the master, a successful Pong > will always follow a successful Ping, so the slave will time out first. > If the master times out first, it will send a ShutdownSlaveMessage, > remove the slave, and terminate the SlaveObserver, never trying to contact > the slave again. The slave may continue trying to send status updates, etc. > in the meantime (which might go through in a one-way disconnect), but the > slave will eventually timeout, disconnect itself (so it stops sending > messages), and consult the MasterDetector for a leader to reregister with. If > the master cannot reach the slave, it is likely that the ZK quorum cannot > reach it either, but if the slave can reach the ZK quorum, or is configured > with a static master, it will indeed try to re-authenticate/register, using a > backoff for registration, but not for authentication (TODO). > If the slave times out first, it will unnecessarily try to > reregister/authenticate even though the master thinks it is still registered. > If the master continues to send messages to the slave, they will all be > dropped with a warning logged, except for Ping messages which will try to > reply with a Pong. If the reregister/authenticate/pong messages from the > slave cannot reach the master, the master will eventually remove the slave. > If the slave is able to reregister before the master removes it (i.e. brief > network connectivity issue), all will go back to normal, but the whole > headache could have been avoided if the slave had timed out after the master. > To try to enforce ordering (master always times out first), we could have > the master timeout after `slave_ping_timeout*max_slave_ping_timeouts` and > have the slave timeout after > `slave_ping_timeout*(max_slave_ping_timeouts+1)`, passing these values from > the master to slave as previously discussed. > Thoughts?
Ping - did you guys reach to an agreement? - Niklas ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29507/#review67113 ----------------------------------------------------------- On Jan. 8, 2015, 12:44 a.m., Adam B wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29507/ > ----------------------------------------------------------- > > (Updated Jan. 8, 2015, 12:44 a.m.) > > > Review request for mesos and Niklas Nielsen. > > > Bugs: MESOS-2110 > https://issues.apache.org/jira/browse/MESOS-2110 > > > Repository: mesos > > > Description > ------- > > Added new --slave_ping_timeout and --max_slave_ping_timeouts flags > to mesos-master to replace the existing (still default) > SLAVE_PING_TIMEOUT (15secs) and MAX_SLAVE_PING_TIMEOUTS (5). > > These can be extended if slaves are expected/allowed to be down for > longer than a minute or two. > > Beware that this affects recovery from network timeouts as well as > actual slave node/process failover. > > > Diffs > ----- > > src/master/constants.hpp c386eab > src/master/constants.cpp 9ee17e9 > src/master/flags.hpp f5c8d2a > src/master/master.cpp d6651e2 > src/slave/constants.hpp fd1c1ab > src/slave/constants.cpp 2a99b11 > src/slave/slave.cpp 50b5781 > src/tests/fault_tolerance_tests.cpp 5763486 > src/tests/partition_tests.cpp fea7801 > src/tests/slave_recovery_tests.cpp cd4a398 > src/tests/slave_tests.cpp c50cbc7 > > Diff: https://reviews.apache.org/r/29507/diff/ > > > Testing > ------- > > Manually tested slave failover/shutdown with master using different > --slave_ping_timeout and --max_slave_ping_timeouts. > Updated unit tests to use shorter non-default values for ping timeouts. > > > Thanks, > > Adam B > >