In latest versions of mesos that is handled via heartbeats. Thanks, Vinod
> On Dec 30, 2019, at 4:37 AM, Charles-François Natali <cf.nat...@gmail.com> > wrote: > > Thanks. > > That's what I thought. The problem though is that it is probably possible > that the zookeeper detector doesn't detect the failure while the connection > to the master fails. One way this could happen would be for example because > of a firewall causing the TCP connection from the framework to the master > to fail, while the zookeeper connections (from master to zk and framework > to zk) still work. Unlikely but possible I think. Having the driver detect > and fail upon EOF/socket error would guard against that. > > > > > >> On Thu, 26 Dec 2019, 18:07 Vinod Kone, <vinodk...@apache.org> wrote: >> >> IIRC, the standalone master detector (the detector that's used when using a >> local ip address of the master and not zk) doesn't re-detect when master >> process restarts. It's a limitation of that detector since it's mainly used >> for testing purposes and not recommended for production use. For >> production, please use zookeeper master detector (this detector is used >> when using zookeeper). >> >> On Fri, Dec 20, 2019 at 5:11 AM Charles-François Natali < >> cf.nat...@gmail.com> >> wrote: >> >>> Hi, >>> >>> It seems that the C++ scheduler driver doesn't detect loss of the >>> connection to the master when not using zookeeper. >>> >>> A simple way to reproduce this is to start a server passing it e.g. >>> "--ip=127.0.0.1", start the scheduler driver passing it "127.0.0.1:5050 >> ", >>> and then send a SIGKILL to the master. The scheduler logs the following: >>> >>> >>> I1220 10:56:11.679347 10635 process.cpp:2928] Resuming >>> __reaper__(1)@192.168.65.76:34345 at 2019-12-20 >>> 10:56:11.679366144+00:00 >>> I1220 10:56:11.679392 10635 clock.cpp:279] Created a timer for >>> __reaper__(1)@192.168.65.76:34345 in 100ms in the future (2019-12-20 >>> 10:56:11.779389952+00:00) >>> I1220 10:56:11.690646 10631 process.cpp:2928] Resuming >>> scheduler-6a93a8e3-5a8f-4195-bde2-718b5832d317@192.168.65.76:34345 at >>> 2019-12-20 10:56:11.690665984+00:00 >>> I1220 10:56:11.690775 10632 process.cpp:2928] Resuming >>> __http__(1)@192.168.65.76:34345 at 2019-12-20 10:56:11.690784000+00:00 >>> I1220 10:56:11.690806 10632 process.cpp:3088] Cleaning up >>> __http__(1)@192.168.65.76:34345 >>> I1220 10:56:11.690914 10632 process.cpp:2928] Resuming >>> help@192.168.65.76:34345 at 2019-12-20 10:56:11.690921984+00:00 >>> >>> An strace confirms that the process receives EOF when reading from the >>> socket, but Scheduler::disconnected isn't called. >>> It's that expected? >>> >>> Or is it assumed that the scheduler relies on zookeeper for detection? >>> >>> Cheers, >>> >>> Charles >>> >>