Cosmin Lehene created MESOS-5114:
------------------------------------
Summary: empty quorum config causes masters to fail replica
recovery and fail
Key: MESOS-5114
URL: https://issues.apache.org/jira/browse/MESOS-5114
Project: Mesos
Issue Type: Bug
Components: master, replicated log
Affects Versions: 0.28.0
Environment: CentOS 7.1
Reporter: Cosmin Lehene
Fix For: 0.28.1
A missing default for quorum size has generated the following master config
{code}
MESOS_WORK_DIR="/var/lib/mesos/master"
MESOS_ZK="zk://zk1:2181,zk2:2181,zk3:2181/mesos"
MESOS_QUORUM=
MESOS_PORT=5050
MESOS_CLUSTER="mesos"
MESOS_LOG_DIR="/var/log/mesos"
MESOS_LOGBUFSECS=1
MESOS_LOGGING_LEVEL="INFO"
{code}
This was causing each elected leader to attempt replica recovery.
E.g. {{group.cpp:700] Trying to get '/mesos/log_replicas/0000000012' in
ZooKeeper}}
And eventually:
{{master.cpp:1458] Recovery failed: Failed to recover registrar: Failed to
perform fetch within 1mins}}
Full log on one of the masters
https://gist.github.com/clehene/09a9ddfe49b92a5deb4c1b421f63479e
All masters and zk nodes were reachable over the network.
Also once the quorum was configured the master recovery protocol finished
gracefully.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)