My understanding is that you need an odd number of monitors to reach quorum. 
This seems to match what you're seeing: with 3, there is a definite leader, but 
with 4, there isn't. Have you tried starting both the 4th and 5th 
simultaneously and letting them both vote?

--
Joshua M. Boniface
Linux System Ærchitect
Sigmentation fault. Core dumped.

On 25/07/16 10:41 AM, Sergio A. de Carvalho Jr. wrote:
> In the logs, there 2 monitors are constantly reporting that they won the 
> leader election:
>
> 60z0m02 (monitor 0):
> 2016-07-25 14:31:11.644335 7f8760af7700  0 log_channel(cluster) log [INF] : 
> mon.60z0m02@0 won leader election with quorum 0,2,4
> 2016-07-25 14:31:44.521552 7f8760af7700  1 mon.60z0m02@0(leader).paxos(paxos 
> recovering c 1318755..1319320) collect timeout, calling fresh election
>
> 60zxl02 (monitor 1):
> 2016-07-25 14:31:59.542346 7fefdeaed700  1 
> mon.60zxl02@1(electing).elector(11441) init, last seen epoch 11441
> 2016-07-25 14:32:04.583929 7fefdf4ee700  0 log_channel(cluster) log [INF] : 
> mon.60zxl02@1 won leader election with quorum 1,2,4
> 2016-07-25 14:32:33.440103 7fefdf4ee700  1 mon.60zxl02@1(leader).paxos(paxos 
> recovering c 1318755..1319319) collect timeout, calling fresh election
>
>
> On Mon, Jul 25, 2016 at 3:27 PM, Sergio A. de Carvalho Jr. 
> <[email protected] <mailto:[email protected]>> wrote:
>
>     Hi,
>
>     I have a cluster of 5 hosts running Ceph 0.94.6 on CentOS 6.5. On each 
> host, there is 1 monitor and 13 OSDs. We had an issue with the network and 
> for some reason (which I still don't know why), the servers were restarted. 
> One host is still down, but the monitors on the 4 remaining servers are 
> failing to enter a quorum.
>
>     I managed to get a quorum of 3 monitors by stopping all Ceph monitors and 
> OSDs across all machines, and bringing up the top 3 ranked monitors in order 
> of rank. After a few minutes, the 60z0m02 monitor (the top ranked one) became 
> the leader:
>
>     {
>         "name": "60z0m02",
>         "rank": 0,
>         "state": "leader",
>         "election_epoch": 11328,
>         "quorum": [
>             0,
>             1,
>             2
>         ],
>         "outside_quorum": [],
>         "extra_probe_peers": [],
>         "sync_provider": [],
>         "monmap": {
>             "epoch": 5,
>             "fsid": "2f51a247-3155-4bcf-9aee-c6f6b2c5e2af",
>             "modified": "2016-04-28 22:26:48.604393",
>             "created": "0.000000",
>             "mons": [
>                 {
>                     "rank": 0,
>                     "name": "60z0m02",
>                     "addr": "10.98.2.166:6789 <http://10.98.2.166:6789>\/0"
>                 },
>                 {
>                     "rank": 1,
>                     "name": "60zxl02",
>                     "addr": "10.98.2.167:6789 <http://10.98.2.167:6789>\/0"
>                 },
>                 {
>                     "rank": 2,
>                     "name": "610wl02",
>                     "addr": "10.98.2.173:6789 <http://10.98.2.173:6789>\/0"
>                 },
>                 {
>                     "rank": 3,
>                     "name": "618yl02",
>                     "addr": "10.98.2.214:6789 <http://10.98.2.214:6789>\/0"
>                 },
>                 {
>                     "rank": 4,
>                     "name": "615yl02",
>                     "addr": "10.98.2.216:6789 <http://10.98.2.216:6789>\/0"
>                 }
>             ]
>         }
>     }
>
>     The other 2 monitors became peons:
>
>     "name": "60zxl02",
>         "rank": 1,
>         "state": "peon",
>         "election_epoch": 11328,
>         "quorum": [
>             0,
>             1,
>             2
>         ],
>
>     "name": "610wl02",
>         "rank": 2,
>         "state": "peon",
>         "election_epoch": 11328,
>         "quorum": [
>             0,
>             1,
>             2
>         ],
>
>     I then proceeded to start the fourth monitor, 615yl02 (618yl02 is powered 
> off), but after more than 2 hours and several election rounds, the monitors 
> still haven't reached a quorum. The monitors alternate mostly between 
> "election", "probing" states but they often seem to be in different election 
> epochs.
>
>     Is this normal?
>
>     Is there anything I can do to help the monitors elect a leader? Should I 
> manually remove the dead host's monitor from the monitor map?
>
>     I left all OSD daemons stopped while the election is going on purpose. Is 
> this the best thing to do? Would bringing the OSDs up help or complicate 
> matters even more? Or doesn't it make any difference?
>
>     I don't see anything obviously wrong in the monitor logs. They're mostly 
> filled with messages like the following:
>
>     2016-07-25 14:17:57.806148 7fc1b3f7e700  1 
> mon.610wl02@2(electing).elector(11411) init, last seen epoch 11411
>     2016-07-25 14:17:57.829198 7fc1b7caf700  0 log_channel(audit) log [DBG] : 
> from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
>     2016-07-25 14:17:57.829200 7fc1b7caf700  0 log_channel(audit) do_log log 
> to syslog
>     2016-07-25 14:17:57.829254 7fc1b7caf700  0 log_channel(audit) log [DBG] : 
> from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
>
>     Any help would be hugely appreciated.
>
>     Thanks,
>
>     Sergio
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to