Cosmin Lehene created MESOS-5114:
------------------------------------

             Summary: empty quorum config causes masters to fail replica 
recovery and fail
                 Key: MESOS-5114
                 URL: https://issues.apache.org/jira/browse/MESOS-5114
             Project: Mesos
          Issue Type: Bug
          Components: master, replicated log
    Affects Versions: 0.28.0
         Environment: CentOS 7.1
            Reporter: Cosmin Lehene
             Fix For: 0.28.1


A missing default for quorum size has generated the following master config 
{code}
MESOS_WORK_DIR="/var/lib/mesos/master"
MESOS_ZK="zk://zk1:2181,zk2:2181,zk3:2181/mesos"
MESOS_QUORUM=

MESOS_PORT=5050
MESOS_CLUSTER="mesos"
MESOS_LOG_DIR="/var/log/mesos"
MESOS_LOGBUFSECS=1
MESOS_LOGGING_LEVEL="INFO"
{code}

This was causing each elected leader to attempt replica recovery.

E.g. {{group.cpp:700] Trying to get '/mesos/log_replicas/0000000012' in 
ZooKeeper}}

And eventually:
{{master.cpp:1458] Recovery failed: Failed to recover registrar: Failed to 
perform fetch within 1mins}}

Full log on one of the masters 
https://gist.github.com/clehene/09a9ddfe49b92a5deb4c1b421f63479e

All masters and zk nodes were reachable over the network. 
Also once the quorum was configured the master recovery protocol finished 
gracefully. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to