[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326975#comment-14326975
 ] 

Daniel Hall commented on MESOS-2186:
------------------------------------

Lets just call the upstart giving up setting a personal preference? In our 
environment we have plenty of spare capacity so having a single machine give up 
is okay, and if the whole cluster is having issues then you are going to need a 
human regardless. In any case its inconsequential to this bug.

In our environment the whole cluster, mesos and its frameworks marathon and 
chronos lose the ability to cast elections. This makes breaks the cluster for 
both the masters and the slaves. Indeed all of the tasks remain running. Once 
you either remove the zookeeper from all the client lists, or provision a new 
server (and hence DNS) with the same name this start connecting again. Indeed 
all the masters return and elect a leader again. However if you look at the 
task list in mesos lots of tasks are missing that are still running on the 
slaves. Marathon also sees all the old tasks running until the next 
reconciliation. The only way we have found to recover from this situation is to 
restart all the mesos-slave processes, which kills all the tasks one each 
slave. I imagine that this is a separate bug, but since I can reproduce it 
faithfully in our staging environment I'll be able to file a better bug report 
once I get some spare time.

We encountered this bug while re provisioning one of the zookeepers in our 
cluster. It seems you can work around the issue by adding an `/etc/hosts` entry 
on the cluster machines for the machine that is about to be removed. The ip 
address you give doesn't even need to be running zookeeper, it just has to be 
able to be resolved.

> Mesos crashes if any configured zookeeper does not resolve.
> -----------------------------------------------------------
>
>                 Key: MESOS-2186
>                 URL: https://issues.apache.org/jira/browse/MESOS-2186
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>         Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>            Reporter: Daniel Hall
>            Priority: Critical
>              Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.159488 
> 28643 contender.cpp:131] Joining the ZK group
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.160753 
> 28640 master.cpp:1202] Successfully attached file 
> '/var/log/mesos/mesos-master.INFO'
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f55fa21f  
> process::schedule()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e498079d1  
> (unknown)
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e494e89dd  
> (unknown)
> Dec  9 22:54:54 mesosmaster-2 abrt[28650]: Not saving repeating crash in 
> '/usr/local/sbin/mesos-master'
> Dec  9 22:54:54 mesosmaster-2 init: mesos-master main process (28627) killed 
> by ABRT signal
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to