On Thu, Mar 26, 2015 at 3:54 PM, Somnath Roy <somnath....@sandisk.com> wrote:
> Greg,
> I think you got me wrong. I am not saying each monitor of a group of 3 should 
> be able to change the map. Here is the scenario.
>
> 1. Cluster up and running with 3 mons (quorum of 3), all fine.
>
> 2. One node (and mon) is down, quorum of 2 , still connecting.
>
> 3. 2 nodes (and 2 mons) are down, should be quorum of 1 now and client should 
> still be able to connect. Isn't it ?

No. The monitors can't tell the difference between dead monitors, and
monitors they can't reach over the network. So they say "there are
three monitors in my map; therefore it requires two to make any
change". That's the case regardless of whether all of them are
running, or only one.

>
> Cluster with single monitor is able to form a quorum and should be working 
> fine. So, why not in case of point 3 ?
> If this is the way Paxos works, should we say that in a cluster with say 3 
> monitors it should be able to tolerate only one mon failure ?

Yes, that is the case.

>
> Let me know if I am missing a point here.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Gregory Farnum [mailto:g...@gregs42.com]
> Sent: Thursday, March 26, 2015 3:41 PM
> To: Somnath Roy
> Cc: Lee Revell; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down
>
> On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy <somnath....@sandisk.com> wrote:
>> Got most portion of it, thanks !
>> But, still not able to get when second node is down why with single monitor 
>> in the cluster client is not able to connect ?
>> 1 monitor can form a quorum and should be sufficient for a cluster to run.
>
> The whole point of the monitor cluster is to ensure a globally consistent 
> view of the cluster state that will never be reversed by a different group of 
> up nodes. If one monitor (out of three) could make changes to the maps by 
> itself, then there's nothing to prevent all three monitors from staying up 
> but getting a net split, and then each issuing different versions of the 
> osdmaps to whichever clients or OSDs happen to be connected to them.
>
> If you want to get down into the math proofs and things then the Paxos papers 
> do all the proofs. Or you can look at the CAP theorem about the tradeoff 
> between consistency and availability. The monitors are a Paxos cluster and 
> Ceph is a 100% consistent system.
> -Greg
>
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Gregory Farnum [mailto:g...@gregs42.com]
>> Sent: Thursday, March 26, 2015 3:29 PM
>> To: Somnath Roy
>> Cc: Lee Revell; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs
>> down
>>
>> On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy <somnath....@sandisk.com> wrote:
>>> Greg,
>>> Couple of dumb question may be.
>>>
>>> 1. If you see , the clients are connecting fine with two monitors in the 
>>> cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
>>> monitor  (which is I guess happening after making 2 nodes down) it is not 
>>> able to connect ?
>>
>> A quorum is a strict majority of the total membership. 2 monitors can form a 
>> quorum just fine if there are either 2 or 3 total membership.
>> (As long as those two agree on every action, it cannot be lost.)
>>
>> We don't *recommend* configuring systems with an even number of
>> monitors, because it increases the number of total possible failures
>> without increasing the number of failures that can be tolerated. (3
>> monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8,
>> etc etc.)
>>
>>>
>>> 2. Also, my understanding is while IO is going on *no* monitor interaction 
>>> will be on that path, so, why the client io will be stopped because the 
>>> monitor quorum is not there ? If the min_size =1 is properly set it should 
>>> able to serve IO as long as 1 OSD (node) is up, isn't it ?
>>
>> Well, the remaining OSD won't be able to process IO because it's lost
>> its peers, and it can't reach any monitors to do updates or get new
>> maps. (Monitors which are not in quorum will not allow clients to
>> connect.)
>> The clients will eventually stop serving IO if they know they can't reach a 
>> monitor, although I don't remember exactly how that's triggered.
>>
>> In this particular case, though, the client probably just tried to do an op 
>> against the dead osd, realized it couldn't, and tried to fetch a map from 
>> the monitors. When that failed it went into search mode, which is what the 
>> logs are showing you.
>> -Greg
>>
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>> Of Gregory Farnum
>>> Sent: Thursday, March 26, 2015 2:40 PM
>>> To: Lee Revell
>>> Cc: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs
>>> down
>>>
>>> On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell <rlrev...@gmail.com> wrote:
>>>> On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum <g...@gregs42.com> wrote:
>>>>>
>>>>> Has the OSD actually been detected as down yet?
>>>>>
>>>>
>>>> I believe it has, however I can't directly check because "ceph health"
>>>> starts to hang when I down the second node.
>>>
>>> Oh. You need to keep a quorum of your monitors running (just the monitor 
>>> processes, not of everything in the system) or nothing at all is going to 
>>> work. That's how we prevent split brain issues.
>>>
>>>>
>>>>>
>>>>> You'll also need to set that min size on your existing pools ("ceph
>>>>> osd pool <pool> set min_size 1" or similar) to change their
>>>>> behavior; the config option only takes effect for newly-created
>>>>> pools. (Thus the
>>>>> "default".)
>>>>
>>>>
>>>> I've done this, however the behavior is the same:
>>>>
>>>> $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do
>>>> ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set
>>>> pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to
>>>> 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6
>>>> min_size to 1 set pool 7 min_size to 1
>>>>
>>>> $ ceph -w
>>>>     cluster db460aa2-5129-4aaa-8b2e-43eac727124e
>>>>      health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
>>>>      monmap e3: 3 mons at
>>>> {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789
>>>> /
>>>> 0 ,ceph-node-3=192.168.122.141:6789/0},
>>>> election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
>>>>      mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
>>>>      osdmap e362: 3 osds: 2 up, 2 in
>>>>       pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
>>>>             25329 MB used, 12649 MB / 40059 MB avail
>>>>                  840 active+clean
>>>>
>>>> 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
>>>> active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
>>>> 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
>>>> active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB
>>>> active+avail;
>>>> active+0 B/s
>>>> rd, 260 kB/s wr, 13 op/s
>>>> 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
>>>> active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB
>>>> active+avail;
>>>> active+0 B/s
>>>> rd, 943 kB/s wr, 38 op/s
>>>> 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840
>>>> active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB
>>>> active+avail;
>>>> active+0 B/s
>>>> rd, 10699 kB/s wr, 621 op/s
>>>>
>>>> <this is where i kill the second OSD>
>>>>
>>>> 2015-03-26 17:26:26.778461 7f4ebeffd700  0 monclient: hunting for
>>>> new mon
>>>> 2015-03-26 17:26:30.701099 7f4ec45f5700  0 --
>>>> 192.168.122.111:0/1007741 >>
>>>> 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0
>>>> l=1 c=0x7f4ec0023490).fault
>>>> 2015-03-26 17:26:42.701154 7f4ec44f4700  0 --
>>>> 192.168.122.111:0/1007741 >>
>>>> 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0
>>>> l=1 c=0x7f4ec0025440).fault
>>>>
>>>> And all writes block until I bring back an OSD.
>>>>
>>>> Lee
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ________________________________
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is 
>>> intended only for the use of the designated recipient(s) named above. If 
>>> the reader of this message is not the intended recipient, you are hereby 
>>> notified that you have received this message in error and that any review, 
>>> dissemination, distribution, or copying of this message is strictly 
>>> prohibited. If you have received this communication in error, please notify 
>>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>>> any and all copies of this message in your possession (whether hard copies 
>>> or electronically stored copies).
>>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to