Re: [ceph-users] UGRENT: add mon failed and ceph monitor refreshlog crazily
It sounds a bit like the extra load on mon.e from the synchronization is preventing it from joining the quorum? If you stop and restart mon.f it should pick a different mon to pull from, though. Perhaps see if that makes a different mon drop out? Then at least we'd understand what is going on... sage On Fri, 13 Feb 2015, minchen wrote: ceph version is 0.80.4 when add mon.f to {b,c,d,e}, mon.e is out quorum, mon.b, mon.c, mon.d are electing in cycle(restart a new election after leader win) . so, i think current 4 monitors can exchange messages to each other successfully. In addtion, mon.f is stuck at state synchronizing, and geting data from mon.e after probing. When I stop mon.f , mon.e goes back to quorum after a while, then ceph cluster becomes HEALTH_OK. But, all mon.b, mon.c, mon.d and mon.e logs are refreshing paxos acitive or updating messages many times per second, and paxos commit seq is increasing fastly. while the same situation not occurs in cluster of ?ceph-0.80.7 If you are still confused, maybe I should reproduct this in our cluster, and get complete mon logs ... -- Original -- From: sweil;sw...@redhat.com; Date: Fri, Feb 13, 2015 10:28 PM To: minchenminc...@ubuntukylin.com; Cc: ceph-usersceph-users@lists.ceph.com; joaoj...@redhat.com; Subject: Re: [ceph-users] UGRENT: add mon failed and ceph monitor refreshlog crazily What version is this? It's hard to tell from the logs below, but it looks like there might be a connectivity problem? Is it able to exchange messages with the other monitors? Perhaps more improtantly, though, if you simply stop the new mon.f, can mon.e join? What is in its log? sage On Fri, 13 Feb 2015, minchen wrote: Hi , all developers and users when i add a new mon to current mon cluter, failed with 2 mon out of quorum. there are 5 mons in our ceph cluster: epoch 7 fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049 last_changed 2015-02-13 09:11:45.758839 created 0.00 0: 10.117.16.17:6789/0 mon.b 1: 10.118.32.7:6789/0 mon.cHEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum) mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum) 2: 10.119.16.11:6789/0 mon.d 3: 10.122.0.9:6789/0 mon.e 4: 10.122.48.11:6789/0 mon.f mon.f is newly added to montior cluster, but when starting mon.f, it caused both mon.e and mon.f out of quorum: HEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum) mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum) mon.b ,mon.c, mon.d, log refresh crazily as following: Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.063628 7f7b64e14700 1 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable now=2015-02-13 09:37:34.063629 lease_expire=2015-02-13 09:37:38.205219 has v0 lc 11819234 Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090647 7f7b64e14700 1 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable now=2015-02-13 09:37:34.090648 lease_expire=2015-02-13 09:37:38.205219 has v0 lc 11819234 Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090661 7f7b64e14700 1 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable now=2015-02-13 09:37:34.090662 lease_expire=2015-02-13 09:37:38.205219 has v0 lc 11819234 .. and mon.f log : Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.526676 7f3931dfd7c0 0 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f), process ceph-mon, pid 30639 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.607412 7f3931dfd7c0 0 mon.f does not exist in monmap, will attempt to join an existing cluster Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.609838 7f3931dfd7c0 0 starting mon.f rank -1 at 10.122.48.11:6789/0 mon_data /osd/ceph/mon fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.610076 7f3931dfd7c0 1 mon.f@-1(probing) e0 preinit fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636499 7f392a504700 0 -- 10.122.48.11:6789/0 10.119.16.11:6789/0 pipe(0x7f3934ebfb80 sd=26 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9ce0).accept connect_seq 0 vs existing 0 state wait Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636797 7f392a201700 0 -- 10.122.48.11:6789/0 10.122.0.9:6789/0 pipe(0x7f3934ec0800 sd=29 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa940).accept connect_seq 0 vs existing 0 state wait Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636968 7f392a403700 0 -- 10.122.48.11:6789/0 10.118.32.7:6789/0 pipe(0x7f3934ec0080 sd=27 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9e40).accept connect_seq 0 vs existing 0 state wait Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.637037 7f392a302700 0 -- 10.122.48.11:6789/0 10.117.16.17:6789/0 pipe
Re: [ceph-users] UGRENT: add mon failed and ceph monitor refreshlog crazily
ceph version is 0.80.4 when add mon.f to {b,c,d,e}, mon.e is out quorum, mon.b, mon.c, mon.d are electing in cycle(restart a new election after leader win) . so, i think current 4 monitors can exchange messages to each other successfully. In addtion, mon.f is stuck at state synchronizing, and geting data from mon.e after probing. When I stop mon.f , mon.e goes back to quorum after a while, then ceph cluster becomes HEALTH_OK. But, all mon.b, mon.c, mon.d and mon.e logs are refreshing paxos acitive or updating messages many times per second, and paxos commit seq is increasing fastly. while the same situation not occurs in cluster of ceph-0.80.7 If you are still confused, maybe I should reproduct this in our cluster, and get complete mon logs ... -- Original -- From: sweil;sw...@redhat.com; Date: Fri, Feb 13, 2015 10:28 PM To: minchenminc...@ubuntukylin.com; Cc: ceph-usersceph-users@lists.ceph.com; joaoj...@redhat.com; Subject: Re: [ceph-users] UGRENT: add mon failed and ceph monitor refreshlog crazily What version is this? It's hard to tell from the logs below, but it looks like there might be a connectivity problem? Is it able to exchange messages with the other monitors? Perhaps more improtantly, though, if you simply stop the new mon.f, can mon.e join? What is in its log? sage On Fri, 13 Feb 2015, minchen wrote: Hi , all developers and users when i add a new mon to current mon cluter, failed with 2 mon out of quorum. there are 5 mons in our ceph cluster: epoch 7 fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049 last_changed 2015-02-13 09:11:45.758839 created 0.00 0: 10.117.16.17:6789/0 mon.b 1: 10.118.32.7:6789/0 mon.cHEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum) mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum) 2: 10.119.16.11:6789/0 mon.d 3: 10.122.0.9:6789/0 mon.e 4: 10.122.48.11:6789/0 mon.f mon.f is newly added to montior cluster, but when starting mon.f, it caused both mon.e and mon.f out of quorum: HEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum) mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum) mon.b ,mon.c, mon.d, log refresh crazily as following: Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.063628 7f7b64e14700 1 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable now=2015-02-13 09:37:34.063629 lease_expire=2015-02-13 09:37:38.205219 has v0 lc 11819234 Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090647 7f7b64e14700 1 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable now=2015-02-13 09:37:34.090648 lease_expire=2015-02-13 09:37:38.205219 has v0 lc 11819234 Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090661 7f7b64e14700 1 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable now=2015-02-13 09:37:34.090662 lease_expire=2015-02-13 09:37:38.205219 has v0 lc 11819234 .. and mon.f log : Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.526676 7f3931dfd7c0 0 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f), process ceph-mon, pid 30639 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.607412 7f3931dfd7c0 0 mon.f does not exist in monmap, will attempt to join an existing cluster Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.609838 7f3931dfd7c0 0 starting mon.f rank -1 at 10.122.48.11:6789/0 mon_data /osd/ceph/mon fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.610076 7f3931dfd7c0 1 mon.f@-1(probing) e0 preinit fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636499 7f392a504700 0 -- 10.122.48.11:6789/0 10.119.16.11:6789/0 pipe(0x7f3934ebfb80 sd=26 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9ce0).accept connect_seq 0 vs existing 0 state wait Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636797 7f392a201700 0 -- 10.122.48.11:6789/0 10.122.0.9:6789/0 pipe(0x7f3934ec0800 sd=29 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa940).accept connect_seq 0 vs existing 0 state wait Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636968 7f392a403700 0 -- 10.122.48.11:6789/0 10.118.32.7:6789/0 pipe(0x7f3934ec0080 sd=27 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9e40).accept connect_seq 0 vs existing 0 state wait Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.637037 7f392a302700 0 -- 10.122.48.11:6789/0 10.117.16.17:6789/0 pipe(0x7f3934ebfe00 sd=28 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa260).accept connect_seq 0 vs existing 0 state wait Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.638854 7f392c00a700 0 mon.f@-1(probing) e7 my rank is now 4 (was -1) Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639365 7f392c00a700 1 mon.f@4(synchronizing) e7 sync_obtain_latest_monmap Feb 13 09:16:26 root