On Thu, Mar 26, 2015 at 3:54 PM, Somnath Roy <somnath....@sandisk.com> wrote: > Greg, > I think you got me wrong. I am not saying each monitor of a group of 3 should > be able to change the map. Here is the scenario. > > 1. Cluster up and running with 3 mons (quorum of 3), all fine. > > 2. One node (and mon) is down, quorum of 2 , still connecting. > > 3. 2 nodes (and 2 mons) are down, should be quorum of 1 now and client should > still be able to connect. Isn't it ?
No. The monitors can't tell the difference between dead monitors, and monitors they can't reach over the network. So they say "there are three monitors in my map; therefore it requires two to make any change". That's the case regardless of whether all of them are running, or only one. > > Cluster with single monitor is able to form a quorum and should be working > fine. So, why not in case of point 3 ? > If this is the way Paxos works, should we say that in a cluster with say 3 > monitors it should be able to tolerate only one mon failure ? Yes, that is the case. > > Let me know if I am missing a point here. > > Thanks & Regards > Somnath > > -----Original Message----- > From: Gregory Farnum [mailto:g...@gregs42.com] > Sent: Thursday, March 26, 2015 3:41 PM > To: Somnath Roy > Cc: Lee Revell; ceph-users@lists.ceph.com > Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down > > On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy <somnath....@sandisk.com> wrote: >> Got most portion of it, thanks ! >> But, still not able to get when second node is down why with single monitor >> in the cluster client is not able to connect ? >> 1 monitor can form a quorum and should be sufficient for a cluster to run. > > The whole point of the monitor cluster is to ensure a globally consistent > view of the cluster state that will never be reversed by a different group of > up nodes. If one monitor (out of three) could make changes to the maps by > itself, then there's nothing to prevent all three monitors from staying up > but getting a net split, and then each issuing different versions of the > osdmaps to whichever clients or OSDs happen to be connected to them. > > If you want to get down into the math proofs and things then the Paxos papers > do all the proofs. Or you can look at the CAP theorem about the tradeoff > between consistency and availability. The monitors are a Paxos cluster and > Ceph is a 100% consistent system. > -Greg > >> >> Thanks & Regards >> Somnath >> >> -----Original Message----- >> From: Gregory Farnum [mailto:g...@gregs42.com] >> Sent: Thursday, March 26, 2015 3:29 PM >> To: Somnath Roy >> Cc: Lee Revell; ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs >> down >> >> On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy <somnath....@sandisk.com> wrote: >>> Greg, >>> Couple of dumb question may be. >>> >>> 1. If you see , the clients are connecting fine with two monitors in the >>> cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 >>> monitor (which is I guess happening after making 2 nodes down) it is not >>> able to connect ? >> >> A quorum is a strict majority of the total membership. 2 monitors can form a >> quorum just fine if there are either 2 or 3 total membership. >> (As long as those two agree on every action, it cannot be lost.) >> >> We don't *recommend* configuring systems with an even number of >> monitors, because it increases the number of total possible failures >> without increasing the number of failures that can be tolerated. (3 >> monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, >> etc etc.) >> >>> >>> 2. Also, my understanding is while IO is going on *no* monitor interaction >>> will be on that path, so, why the client io will be stopped because the >>> monitor quorum is not there ? If the min_size =1 is properly set it should >>> able to serve IO as long as 1 OSD (node) is up, isn't it ? >> >> Well, the remaining OSD won't be able to process IO because it's lost >> its peers, and it can't reach any monitors to do updates or get new >> maps. (Monitors which are not in quorum will not allow clients to >> connect.) >> The clients will eventually stop serving IO if they know they can't reach a >> monitor, although I don't remember exactly how that's triggered. >> >> In this particular case, though, the client probably just tried to do an op >> against the dead osd, realized it couldn't, and tried to fetch a map from >> the monitors. When that failed it went into search mode, which is what the >> logs are showing you. >> -Greg >> >>> >>> Thanks & Regards >>> Somnath >>> >>> -----Original Message----- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>> Of Gregory Farnum >>> Sent: Thursday, March 26, 2015 2:40 PM >>> To: Lee Revell >>> Cc: ceph-users@lists.ceph.com >>> Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs >>> down >>> >>> On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell <rlrev...@gmail.com> wrote: >>>> On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum <g...@gregs42.com> wrote: >>>>> >>>>> Has the OSD actually been detected as down yet? >>>>> >>>> >>>> I believe it has, however I can't directly check because "ceph health" >>>> starts to hang when I down the second node. >>> >>> Oh. You need to keep a quorum of your monitors running (just the monitor >>> processes, not of everything in the system) or nothing at all is going to >>> work. That's how we prevent split brain issues. >>> >>>> >>>>> >>>>> You'll also need to set that min size on your existing pools ("ceph >>>>> osd pool <pool> set min_size 1" or similar) to change their >>>>> behavior; the config option only takes effect for newly-created >>>>> pools. (Thus the >>>>> "default".) >>>> >>>> >>>> I've done this, however the behavior is the same: >>>> >>>> $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do >>>> ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set >>>> pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to >>>> 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 >>>> min_size to 1 set pool 7 min_size to 1 >>>> >>>> $ ceph -w >>>> cluster db460aa2-5129-4aaa-8b2e-43eac727124e >>>> health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2 >>>> monmap e3: 3 mons at >>>> {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789 >>>> / >>>> 0 ,ceph-node-3=192.168.122.141:6789/0}, >>>> election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2 >>>> mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active} >>>> osdmap e362: 3 osds: 2 up, 2 in >>>> pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects >>>> 25329 MB used, 12649 MB / 40059 MB avail >>>> 840 active+clean >>>> >>>> 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840 >>>> active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail >>>> 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840 >>>> active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB >>>> active+avail; >>>> active+0 B/s >>>> rd, 260 kB/s wr, 13 op/s >>>> 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840 >>>> active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB >>>> active+avail; >>>> active+0 B/s >>>> rd, 943 kB/s wr, 38 op/s >>>> 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840 >>>> active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB >>>> active+avail; >>>> active+0 B/s >>>> rd, 10699 kB/s wr, 621 op/s >>>> >>>> <this is where i kill the second OSD> >>>> >>>> 2015-03-26 17:26:26.778461 7f4ebeffd700 0 monclient: hunting for >>>> new mon >>>> 2015-03-26 17:26:30.701099 7f4ec45f5700 0 -- >>>> 192.168.122.111:0/1007741 >> >>>> 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 >>>> l=1 c=0x7f4ec0023490).fault >>>> 2015-03-26 17:26:42.701154 7f4ec44f4700 0 -- >>>> 192.168.122.111:0/1007741 >> >>>> 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 >>>> l=1 c=0x7f4ec0025440).fault >>>> >>>> And all writes block until I bring back an OSD. >>>> >>>> Lee >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> ________________________________ >>> >>> PLEASE NOTE: The information contained in this electronic mail message is >>> intended only for the use of the designated recipient(s) named above. If >>> the reader of this message is not the intended recipient, you are hereby >>> notified that you have received this message in error and that any review, >>> dissemination, distribution, or copying of this message is strictly >>> prohibited. If you have received this communication in error, please notify >>> the sender by telephone or e-mail (as shown above) immediately and destroy >>> any and all copies of this message in your possession (whether hard copies >>> or electronically stored copies). >>> _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com