In latest versions of mesos that is handled via heartbeats. 

Thanks,
Vinod

> On Dec 30, 2019, at 4:37 AM, Charles-François Natali <cf.nat...@gmail.com> 
> wrote:
> 
> Thanks.
> 
> That's what I thought. The problem though is that it is probably possible
> that the zookeeper detector doesn't detect the failure while the connection
> to the master fails. One way this could happen would be for example because
> of a firewall causing the TCP connection from the framework to the master
> to fail, while the zookeeper connections (from master to zk and framework
> to zk) still work. Unlikely but possible I think. Having the driver detect
> and fail upon EOF/socket error would guard against that.
> 
> 
> 
> 
> 
>> On Thu, 26 Dec 2019, 18:07 Vinod Kone, <vinodk...@apache.org> wrote:
>> 
>> IIRC, the standalone master detector (the detector that's used when using a
>> local ip address of the master and not zk) doesn't re-detect when master
>> process restarts. It's a limitation of that detector since it's mainly used
>> for testing purposes and not recommended for production use. For
>> production, please use zookeeper master detector (this detector is used
>> when using zookeeper).
>> 
>> On Fri, Dec 20, 2019 at 5:11 AM Charles-François Natali <
>> cf.nat...@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> It seems that the C++ scheduler driver doesn't detect loss of the
>>> connection to the master when not using zookeeper.
>>> 
>>> A simple way to reproduce this is to start a server passing it e.g.
>>> "--ip=127.0.0.1", start the scheduler driver passing it "127.0.0.1:5050
>> ",
>>> and then send a SIGKILL to the master. The scheduler logs the following:
>>> 
>>> 
>>> I1220 10:56:11.679347 10635 process.cpp:2928] Resuming
>>> __reaper__(1)@192.168.65.76:34345 at 2019-12-20
>>> 10:56:11.679366144+00:00
>>> I1220 10:56:11.679392 10635 clock.cpp:279] Created a timer for
>>> __reaper__(1)@192.168.65.76:34345 in 100ms in the future (2019-12-20
>>> 10:56:11.779389952+00:00)
>>> I1220 10:56:11.690646 10631 process.cpp:2928] Resuming
>>> scheduler-6a93a8e3-5a8f-4195-bde2-718b5832d317@192.168.65.76:34345 at
>>> 2019-12-20 10:56:11.690665984+00:00
>>> I1220 10:56:11.690775 10632 process.cpp:2928] Resuming
>>> __http__(1)@192.168.65.76:34345 at 2019-12-20 10:56:11.690784000+00:00
>>> I1220 10:56:11.690806 10632 process.cpp:3088] Cleaning up
>>> __http__(1)@192.168.65.76:34345
>>> I1220 10:56:11.690914 10632 process.cpp:2928] Resuming
>>> help@192.168.65.76:34345 at 2019-12-20 10:56:11.690921984+00:00
>>> 
>>> An strace confirms that the process receives EOF when reading from the
>>> socket, but Scheduler::disconnected isn't called.
>>> It's that expected?
>>> 
>>> Or is it assumed that the scheduler relies on zookeeper for detection?
>>> 
>>> Cheers,
>>> 
>>> Charles
>>> 
>> 

Reply via email to