[ 
https://issues.apache.org/jira/browse/MESOS-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004993#comment-14004993
 ] 

Benjamin Mahler commented on MESOS-1326:
----------------------------------------

On the other hand, there are indeed some cases where the slave is able to 
restart within 10 seconds:

{noformat}
W0521 10:02:13.549584 54783 slave.cpp:425] Ignoring shutdown message from 
[email protected]:5050 because it is not from the registered master: None
2014-05-21 10:02:14,973:54764(0x7fb89cfc6940):ZOO_ERROR@getaddrs@599: 
getaddrinfo: Invalid argument

F0521 10:02:14.973850 54772 zookeeper.cpp:74] Failed to create ZooKeeper, 
zookeeper_init: Invalid argument [22]
*** Check failure stack trace: ***
    @     0x7fb8a48125fd  google::LogMessage::Fail()
    @     0x7fb8a4814444  google::LogMessage::SendToLog()
    @     0x7fb8a48121ec  google::LogMessage::Flush()
    @     0x7fb8a48123f9  google::LogMessage::~LogMessage()
    @     0x7fb8a4813372  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7fb8a455d561  ZooKeeper::ZooKeeper()
    @     0x7fb8a4567f38  zookeeper::GroupProcess::expired()
    @     0x7fb8a4568198  zookeeper::GroupProcess::timedout()
    @     0x7fb8a47481c2  process::ProcessManager::resume()
    @     0x7fb8a47484bc  process::schedule()
    @     0x7fb8a3cbc83d  start_thread
    @     0x7fb8a2a2426d  clone
/usr/local/bin/mesos-slave.sh: line 115: 54764 Aborted                 (core 
dumped) $debug /usr/local/sbin/mesos-slave --port=5051 
--resources="${MESOS_RESOURCES}" --attributes="${MESOS_ATTR
IBUTES}" --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
Slave Exit Status: 134
I0521 10:02:22.104622 44885 logging.cpp:106] Logging INFO level started!
I0521 10:02:22.105105 44885 main.cpp:126] Build: 2014-04-24 19:52:05 by 
mockbuild
I0521 10:02:22.105131 44885 main.cpp:128] Version: 0.19.0-tw3
W0521 10:02:22.105160 44885 containerizer.cpp:169] The 'cgroups' isolation flag 
is deprecated, please update your flags to 
'--isolation=cgroups/cpu,cgroups/mem'.
I0521 10:02:22.105423 44885 containerizer.cpp:177] Using isolation: 
cgroups/cpu,cgroups/mem
I0521 10:02:22.128782 44885 cgroups_launcher.cpp:58] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the cgroups launcher
2014-05-21 10:02:22,129:44885(0x7f6d93c62940):ZOO_INFO@log_env@712: Client 
environment:zookeeper.version=zookeeper C client 3.4.5
I0521 10:02:22.129151 44885 main.cpp:149] Starting Mesos slave
{noformat}

> Retry policy for zookeeper_init failures
> ----------------------------------------
>
>                 Key: MESOS-1326
>                 URL: https://issues.apache.org/jira/browse/MESOS-1326
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 0.19.0
>            Reporter: Jie Yu
>              Labels: reliability
>
> Currently, we fatal when we have a zookeeper_init failure. Sometimes, this is 
> annoying because during a DNS failover, we may experience this a lot and we 
> don't necessary need to fatal on those cases.
> I am wondering whether we can retry on zookeeper_init failures?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to