[
https://issues.apache.org/jira/browse/MESOS-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225339#comment-15225339
]
Cosmin Lehene commented on MESOS-5114:
--------------------------------------
[~jieyu] This is merely an annoyance, I wouldn't say it's a blocker, but may be
problematic due to the fact that it's not obvious what's going on.
This happened as part of an upgrade that removed the quorum configuration
default
It took me more than an hour of deleting state on host and zk before realizing
what's going on.
> empty quorum config causes masters to fail replica recovery and fail
> --------------------------------------------------------------------
>
> Key: MESOS-5114
> URL: https://issues.apache.org/jira/browse/MESOS-5114
> Project: Mesos
> Issue Type: Bug
> Components: master, replicated log
> Affects Versions: 0.28.0
> Environment: CentOS 7.1
> Reporter: Cosmin Lehene
> Fix For: 0.28.1
>
>
> A missing default for quorum size has generated the following master config
> {code}
> MESOS_WORK_DIR="/var/lib/mesos/master"
> MESOS_ZK="zk://zk1:2181,zk2:2181,zk3:2181/mesos"
> MESOS_QUORUM=
> MESOS_PORT=5050
> MESOS_CLUSTER="mesos"
> MESOS_LOG_DIR="/var/log/mesos"
> MESOS_LOGBUFSECS=1
> MESOS_LOGGING_LEVEL="INFO"
> {code}
> This was causing each elected leader to attempt replica recovery.
> E.g. {{group.cpp:700] Trying to get '/mesos/log_replicas/0000000012' in
> ZooKeeper}}
> And eventually:
> {{master.cpp:1458] Recovery failed: Failed to recover registrar: Failed to
> perform fetch within 1mins}}
> Full log on one of the masters
> https://gist.github.com/clehene/09a9ddfe49b92a5deb4c1b421f63479e
> All masters and zk nodes were reachable over the network.
> Also once the quorum was configured the master recovery protocol finished
> gracefully.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)