[ 
https://issues.apache.org/jira/browse/MESOS-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526221#comment-14526221
 ] 

Benjamin Mahler commented on MESOS-2679:
----------------------------------------

Yes, the counter is cleared once a pong message is received. We currently don't 
have logging for the timeouts of individual pings, but we log the time at which 
5 consecutive timeouts occur (master.cpp:237 above at 15:12:00) and the time at 
which the slave receives the shutdown message (slave.cpp:571 above at 
15:12:12), this is normally in the milliseconds (network delay) but in your 
case it took 12 seconds. This is why I suspect network issues.

> Slave asked to shut down by master because 'health check timed out'
> -------------------------------------------------------------------
>
>                 Key: MESOS-2679
>                 URL: https://issues.apache.org/jira/browse/MESOS-2679
>             Project: Mesos
>          Issue Type: Bug
>          Components: isolation
>    Affects Versions: 0.22.1
>            Reporter: Littlestar
>
> I run spark 1.3.1 on mesos 0.22.1 rc6 (linux64), some mesos slave node 
> offline.....
> slave node logs:
> I0430 15:12:12.737057 32354 slave.cpp:571] Slave asked to shut down by 
> [email protected]:5050 because 'health check timed out'
> master node logs:
> I0430 15:12:00.615777 19759 master.cpp:237] Shutting down slave 
> 20150430-141442-1214949568-5050-19747-S2 due to health check timeout
> W0430 15:12:00.616083 19751 master.cpp:3417] Shutting down slave 
> 20150430-141442-1214949568-5050-19747-S2 at slave(1)@192.168.1.15:5051 
> (hpblade05) with message 'health check timed out'
> why master-slave offline and not restart itself? 
> Any configurations to increase this timeout interval?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to