[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325540#comment-14325540
 ] 

Daniel Hall commented on MESOS-2186:
------------------------------------

The zookeeper cluster is available still. We run three zookeeper servers. If a 
single one is down the cluster can still operate because the quorum only 
requires two machines to be up. However if a single server is not resolving in 
DNS then this bug is triggered and mesos is unable to connect to the cluster 
despite it having quorum.

Absolutely upstart should give up if it has tried restarting the process many 
times in a short period. If it didn't it could be responsible for a thundering 
herd issue. We would rather it stop trying and alert a human operator.

> Mesos crashes if any configured zookeeper does not resolve.
> -----------------------------------------------------------
>
>                 Key: MESOS-2186
>                 URL: https://issues.apache.org/jira/browse/MESOS-2186
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>         Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>            Reporter: Daniel Hall
>            Priority: Critical
>              Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.159488 
> 28643 contender.cpp:131] Joining the ZK group
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.160753 
> 28640 master.cpp:1202] Successfully attached file 
> '/var/log/mesos/mesos-master.INFO'
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f55fa21f  
> process::schedule()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e498079d1  
> (unknown)
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e494e89dd  
> (unknown)
> Dec  9 22:54:54 mesosmaster-2 abrt[28650]: Not saving repeating crash in 
> '/usr/local/sbin/mesos-master'
> Dec  9 22:54:54 mesosmaster-2 init: mesos-master main process (28627) killed 
> by ABRT signal
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to