Miguel Bernadin created MESOS-7920:
--------------------------------------

             Summary: Mesos Agents Crash If Loss Connection to ZK
                 Key: MESOS-7920
                 URL: https://issues.apache.org/jira/browse/MESOS-7920
             Project: Mesos
          Issue Type: Bug
          Components: agent
    Affects Versions: 1.3.1
            Reporter: Miguel Bernadin
            Assignee: Vinod Kone


There is an issue where by if mesos agents are dead because they lost access to 
the zookeeper quorum. Once when the the {{dcos-mesos-slave.service}} main 
process exited, code=killed, status=6/ABRT.

 

_*Mesos Agents Exiting with Loss of ZK Connectivity*_
{code:java}
mesos-slave[12971]: 2017-08-09 
20:12:44,698:12971(0x7fd2161d3700):ZOO_INFO@zookeeper_init@786: Initiating 
client connection, host=leader.mesos:2181 sessionTimeout=10000 
watcher=0x7fd22188d250 sessionId=0 sessionPasswd=<null> con
mesos-slave[12971]: 2017-08-09 
20:12:44,713:12971(0x7fd2161d3700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such 
file or directory
mesos-slave[12971]: F0809 20:12:44.713604 12988 zookeeper.cpp:132] Failed to 
create ZooKeeper, zookeeper_init: No such file or directory [2]
mesos-slave[12971]: *** Check failure stack trace: ***
mesos-slave[12971]: @ 0x7fd22075a9fd google::LogMessage::Fail()
mesos-slave[12971]: @ 0x7fd22075c89d google::LogMessage::SendToLog()
mesos-slave[12971]: @ 0x7fd22075a5ec google::LogMessage::Flush()
mesos-slave[12971]: @ 0x7fd22075a7f9 google::LogMessage::~LogMessage()
mesos-slave[12971]: @ 0x7fd22075b76e google::ErrnoLogMessage::~ErrnoLogMessage()
mesos-slave[12971]: @ 0x7fd22188daf3 ZooKeeperProcess::initialize()
mesos-slave[12971]: @ 0x7fd221c424c1 process::ProcessManager::resume()
mesos-slave[12971]: @ 0x7fd221c42777 
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
mesos-slave[12971]: @ 0x7fd2201fad73 (unknown)
mesos-slave[12971]: @ 0x7fd21fcfb52c (unknown)
mesos-slave[12971]: @ 0x7fd21fa391dd (unknown)
systemd[1]: dcos-mesos-slave.service: Main process exited, code=killed, 
status=6/ABRT
systemd[1]: dcos-mesos-slave.service: Unit entered failed state.
systemd[1]: dcos-mesos-slave.service: Failed with result 'signal'.
systemd[1]: dcos-mesos-slave.service: Service hold-off time over, scheduling 
restart.{code}
 
*NEXT STEP*

Determine if we can change the behavior how Mesos responds to loss of access to 
ZK.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to