Re: [ceph-users] UGRENT: add mon failed and ceph monitor refreshlog crazily

2015-02-13 Thread Sage Weil
It sounds a bit like the extra load on mon.e from the synchronization is 
preventing it from joining the quorum?  If you stop and restart mon.f it 
should pick a different mon to pull from, though.  Perhaps see if that 
makes a different mon drop out?  Then at least we'd understand what is 
going on...

sage

On Fri, 13 Feb 2015, minchen wrote:

 
 ceph version is 0.80.4
 
 when add mon.f to {b,c,d,e}, mon.e is out quorum, 
 mon.b, mon.c, mon.d are electing in cycle(restart a new election after
 leader win) .
 so,  i think current 4 monitors can exchange messages to each other
 successfully.
 
 In addtion, mon.f is stuck at state synchronizing, and geting data from
 mon.e after probing. 
 
 When I stop mon.f , mon.e goes back to quorum after a while, then ceph
 cluster becomes HEALTH_OK.
 But, all mon.b, mon.c, mon.d and mon.e logs are refreshing paxos acitive or
 updating messages many times per second, 
 and paxos commit seq is increasing fastly.  while the same situation not
 occurs in cluster of ?ceph-0.80.7 
 
 If you are still confused, maybe I should reproduct this in our cluster, and
 get complete mon logs ...
 
 -- Original --
 From:  sweil;sw...@redhat.com;
 Date:  Fri, Feb 13, 2015 10:28 PM
 To:  minchenminc...@ubuntukylin.com;
 Cc:  ceph-usersceph-users@lists.ceph.com; joaoj...@redhat.com;
 Subject:  Re: [ceph-users] UGRENT: add mon failed and ceph monitor
 refreshlog crazily
 
 What version is this?
 
 It's hard to tell from the logs below, but it looks like there might be a
 connectivity problem?  Is it able to exchange messages with the other
 monitors?
 
 Perhaps more improtantly, though, if you simply stop the new mon.f, can
 mon.e join?  What is in its log?
 
 sage
 
 
 On Fri, 13 Feb 2015, minchen wrote:
 
  Hi , 
    all developers and users
  when i add a new mon to current mon cluter, failed with 2 mon out of
 quorum.
 
 
  there are 5 mons in our ceph cluster:
  epoch 7
  fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049
  last_changed 2015-02-13 09:11:45.758839
  created 0.00
  0: 10.117.16.17:6789/0 mon.b
  1: 10.118.32.7:6789/0 mon.cHEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d
  mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum)
  mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum)
  2: 10.119.16.11:6789/0 mon.d
  3: 10.122.0.9:6789/0 mon.e
  4: 10.122.48.11:6789/0 mon.f
 
 
  mon.f is newly added to montior cluster, but when starting mon.f,
  it caused both  mon.e and mon.f out of quorum:
  HEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d
  mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum)
  mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum)
 
 
  mon.b ,mon.c, mon.d, log refresh crazily as following:
  Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.063628 7f7b64e14700  1
 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable
 now=2015-02-13 09:37:34.063629 lease_expire=2015-02-13 09:37:38.205219 has
 v0 lc 11819234
  Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090647 7f7b64e14700  1
 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable
 now=2015-02-13 09:37:34.090648 lease_expire=2015-02-13 09:37:38.205219 has
 v0 lc 11819234
  Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090661 7f7b64e14700  1
 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable
 now=2015-02-13 09:37:34.090662 lease_expire=2015-02-13 09:37:38.205219 has
 v0 lc 11819234
  ..
 
 
  and mon.f log :
 
 
  Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.526676 7f3931dfd7c0  0
 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f), process
 ceph-mon, pid 30639
  Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.607412 7f3931dfd7c0  0
 mon.f does not exist in monmap, will attempt to join an existing cluster
  Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.609838 7f3931dfd7c0  0
 starting mon.f rank -1 at 10.122.48.11:6789/0 mon_data /osd/ceph/mon fsid
 0dfd2bd5-1896-4712-916b-ec02dcc7b049
  Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.610076 7f3931dfd7c0  1
 mon.f@-1(probing) e0 preinit fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049
  Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636499 7f392a504700  0
 -- 10.122.48.11:6789/0  10.119.16.11:6789/0 pipe(0x7f3934ebfb80 sd=26
 :6789 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9ce0).accept connect_seq 0 vs existing
 0 state wait
  Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636797 7f392a201700  0
 -- 10.122.48.11:6789/0  10.122.0.9:6789/0 pipe(0x7f3934ec0800 sd=29 :6789
 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa940).accept connect_seq 0 vs existing 0
 state wait
  Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636968 7f392a403700  0
 -- 10.122.48.11:6789/0  10.118.32.7:6789/0 pipe(0x7f3934ec0080 sd=27 :6789
 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9e40).accept connect_seq 0 vs existing 0
 state wait
  Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.637037 7f392a302700  0
 -- 10.122.48.11:6789/0  10.117.16.17:6789/0 pipe

Re: [ceph-users] UGRENT: add mon failed and ceph monitor refreshlog crazily

2015-02-13 Thread minchen
ceph version is 0.80.4


when add mon.f to {b,c,d,e}, mon.e is out quorum, 
mon.b, mon.c, mon.d are electing in cycle(restart a new election after leader 
win) .
so,  i think current 4 monitors can exchange messages to each other 
successfully.


In addtion, mon.f is stuck at state synchronizing, and geting data from mon.e 
after probing. 


When I stop mon.f , mon.e goes back to quorum after a while, then ceph cluster 
becomes HEALTH_OK.
But, all mon.b, mon.c, mon.d and mon.e logs are refreshing paxos acitive or 
updating messages many times per second, 
and paxos commit seq is increasing fastly.  while the same situation not occurs 
in cluster of ‍ceph-0.80.7 


If you are still confused, maybe I should reproduct this in our cluster, and 
get complete mon logs ...


-- Original --
From:  sweil;sw...@redhat.com;
Date:  Fri, Feb 13, 2015 10:28 PM
To:  minchenminc...@ubuntukylin.com; 
Cc:  ceph-usersceph-users@lists.ceph.com; joaoj...@redhat.com; 
Subject:  Re: [ceph-users] UGRENT: add mon failed and ceph monitor refreshlog 
crazily



What version is this?

It's hard to tell from the logs below, but it looks like there might be a 
connectivity problem?  Is it able to exchange messages with the other 
monitors?

Perhaps more improtantly, though, if you simply stop the new mon.f, can 
mon.e join?  What is in its log?

sage


On Fri, 13 Feb 2015, minchen wrote:

 Hi ,  
   all developers and users
 when i add a new mon to current mon cluter, failed with 2 mon out of quorum.
 
 
 there are 5 mons in our ceph cluster: 
 epoch 7
 fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049
 last_changed 2015-02-13 09:11:45.758839
 created 0.00
 0: 10.117.16.17:6789/0 mon.b
 1: 10.118.32.7:6789/0 mon.cHEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d
 mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum)
 mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum)
 2: 10.119.16.11:6789/0 mon.d
 3: 10.122.0.9:6789/0 mon.e
 4: 10.122.48.11:6789/0 mon.f
 
 
 mon.f is newly added to montior cluster, but when starting mon.f, 
 it caused both  mon.e and mon.f out of quorum:
 HEALTH_WARN 2 mons down, quorum 0,1,2 b,c,d
 mon.e (rank 3) addr 10.122.0.9:6789/0 is down (out of quorum)
 mon.f (rank 4) addr 10.122.48.11:6789/0 is down (out of quorum)
 
 
 mon.b ,mon.c, mon.d, log refresh crazily as following:
 Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.063628 7f7b64e14700  1 
 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable 
 now=2015-02-13 09:37:34.063629 lease_expire=2015-02-13 09:37:38.205219 has v0 
 lc 11819234
 Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090647 7f7b64e14700  1 
 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable 
 now=2015-02-13 09:37:34.090648 lease_expire=2015-02-13 09:37:38.205219 has v0 
 lc 11819234
 Feb 13 09:37:34 root ceph-mon: 2015-02-13 09:37:34.090661 7f7b64e14700  1 
 mon.b@0(leader).paxos(paxos active c 11818589..11819234) is_readable 
 now=2015-02-13 09:37:34.090662 lease_expire=2015-02-13 09:37:38.205219 has v0 
 lc 11819234
 ..
 
 
 and mon.f log :
 
 
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.526676 7f3931dfd7c0  0 
 ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f), process 
 ceph-mon, pid 30639
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.607412 7f3931dfd7c0  0 
 mon.f does not exist in monmap, will attempt to join an existing cluster
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.609838 7f3931dfd7c0  0 
 starting mon.f rank -1 at 10.122.48.11:6789/0 mon_data /osd/ceph/mon fsid 
 0dfd2bd5-1896-4712-916b-ec02dcc7b049
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.610076 7f3931dfd7c0  1 
 mon.f@-1(probing) e0 preinit fsid 0dfd2bd5-1896-4712-916b-ec02dcc7b049
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636499 7f392a504700  0 -- 
 10.122.48.11:6789/0  10.119.16.11:6789/0 pipe(0x7f3934ebfb80 sd=26 :6789 
 s=0 pgs=0 cs=0 l=0 c=0x7f3934ea9ce0).accept connect_seq 0 vs existing 0 state 
 wait
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636797 7f392a201700  0 -- 
 10.122.48.11:6789/0  10.122.0.9:6789/0 pipe(0x7f3934ec0800 sd=29 :6789 s=0 
 pgs=0 cs=0 l=0 c=0x7f3934eaa940).accept connect_seq 0 vs existing 0 state wait
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.636968 7f392a403700  0 -- 
 10.122.48.11:6789/0  10.118.32.7:6789/0 pipe(0x7f3934ec0080 sd=27 :6789 s=0 
 pgs=0 cs=0 l=0 c=0x7f3934ea9e40).accept connect_seq 0 vs existing 0 state wait
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.637037 7f392a302700  0 -- 
 10.122.48.11:6789/0  10.117.16.17:6789/0 pipe(0x7f3934ebfe00 sd=28 :6789 
 s=0 pgs=0 cs=0 l=0 c=0x7f3934eaa260).accept connect_seq 0 vs existing 0 state 
 wait
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.638854 7f392c00a700  0 
 mon.f@-1(probing) e7  my rank is now 4 (was -1)
 Feb 13 09:16:26 root ceph-mon: 2015-02-13 09:16:26.639365 7f392c00a700  1 
 mon.f@4(synchronizing) e7 sync_obtain_latest_monmap
 Feb 13 09:16:26 root