Miguel Bernadin created MESOS-7920:
--------------------------------------
Summary: Mesos Agents Crash If Loss Connection to ZK
Key: MESOS-7920
URL: https://issues.apache.org/jira/browse/MESOS-7920
Project: Mesos
Issue Type: Bug
Components: agent
Affects Versions: 1.3.1
Reporter: Miguel Bernadin
Assignee: Vinod Kone
There is an issue where by if mesos agents are dead because they lost access to
the zookeeper quorum. Once when the the {{dcos-mesos-slave.service}} main
process exited, code=killed, status=6/ABRT.
_*Mesos Agents Exiting with Loss of ZK Connectivity*_
{code:java}
mesos-slave[12971]: 2017-08-09
20:12:44,698:12971(0x7fd2161d3700):ZOO_INFO@zookeeper_init@786: Initiating
client connection, host=leader.mesos:2181 sessionTimeout=10000
watcher=0x7fd22188d250 sessionId=0 sessionPasswd=<null> con
mesos-slave[12971]: 2017-08-09
20:12:44,713:12971(0x7fd2161d3700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such
file or directory
mesos-slave[12971]: F0809 20:12:44.713604 12988 zookeeper.cpp:132] Failed to
create ZooKeeper, zookeeper_init: No such file or directory [2]
mesos-slave[12971]: *** Check failure stack trace: ***
mesos-slave[12971]: @ 0x7fd22075a9fd google::LogMessage::Fail()
mesos-slave[12971]: @ 0x7fd22075c89d google::LogMessage::SendToLog()
mesos-slave[12971]: @ 0x7fd22075a5ec google::LogMessage::Flush()
mesos-slave[12971]: @ 0x7fd22075a7f9 google::LogMessage::~LogMessage()
mesos-slave[12971]: @ 0x7fd22075b76e google::ErrnoLogMessage::~ErrnoLogMessage()
mesos-slave[12971]: @ 0x7fd22188daf3 ZooKeeperProcess::initialize()
mesos-slave[12971]: @ 0x7fd221c424c1 process::ProcessManager::resume()
mesos-slave[12971]: @ 0x7fd221c42777
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
mesos-slave[12971]: @ 0x7fd2201fad73 (unknown)
mesos-slave[12971]: @ 0x7fd21fcfb52c (unknown)
mesos-slave[12971]: @ 0x7fd21fa391dd (unknown)
systemd[1]: dcos-mesos-slave.service: Main process exited, code=killed,
status=6/ABRT
systemd[1]: dcos-mesos-slave.service: Unit entered failed state.
systemd[1]: dcos-mesos-slave.service: Failed with result 'signal'.
systemd[1]: dcos-mesos-slave.service: Service hold-off time over, scheduling
restart.{code}
*NEXT STEP*
Determine if we can change the behavior how Mesos responds to loss of access to
ZK.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)