On Mar 31, 2011, at 2:30 PM, Christoph Bartoschek wrote:
> Am 29.03.2011 15:31, schrieb Dejan Muhamedagic:
>> On Tue, Mar 29, 2011 at 08:13:49AM +0200, Christoph Bartoschek wrote:
>>> Am 29.03.2011 02:35, schrieb Vadym Chepkov:
>>>>
>>>> On Mar 28, 2011, at 10:55 AM, Christoph Bartoschek wrote:
>>>>
>>>>> Am 28.03.2011 16:30, schrieb Dejan Muhamedagic:
>>>>>> Hi,
>>>>>>
>>>>>> On Mon, Mar 21, 2011 at 11:33:49PM +0100, Christoph Bartoschek wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am testing a NFS failover setup. During the tests I created a
>>>>>>> split-brain situation and now node A thinks it is primary and uptodate
>>>>>>> while node B thinks that it is Outdated.
>>>>>>>
>>>>>>> crm_mon however does not indicate any error to me. Why is this the case?
>>>>>>> I expect to see anything that shows me the degraded status. How can this
>>>>>>> be fixed?
>>>>>>
>>>>>> The cluster relies on the RA (in this case drbd) to report any
>>>>>> problems. Do you have a monitor operation defined for that
>>>>>> resource?
>>>>>
>>>>> I have the resource defined as:
>>>>>
>>>>> primitive p_drbd ocf:linbit:drbd \
>>>>> params drbd_resource="home-data"
>>>>> op monitor interval="15" role="Master" \
>>>>> op monitor interval="30" role="Slave"
>>>>>
>>>>> Is this a correct monitor operation?
>>
>> Yes, though you should also add timeout specs.
>>
>>>> Just out of curiosity, you do have ms resource defined?
>>>>
>>>> ms ms_p_drbd p_drbd \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1" notify="true"
>>>>
>>>> Because if you do and cluster is not aware of the split-brain, drbd RA has
>>>> a serious flaw.
>>>>
>>>
>>> I'm sorry. Yes, the ms resource is also defined.
>>
>> Well, I'm really confused. You basically say that the drbd disk
>> gets into a degraded mode (i.e. it detects split brain), but the
>> cluster (pacemaker) never learns about that. Perhaps you should
>> open a bugzilla for this and supply hb_report. Though it's
>> really hard to believe. It's like basic functionality failing.
>
>
> What would you expect to see?
>
> Currently I see the following in crm_mon:
>
> Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
> Masters: [ ries ]
> Slaves: [ laplace ]
>
>
> At the same time "cat /proc/drbd" on ries says:
>
> ries:~ # cat /proc/drbd
> version: 8.3.9 (api:88/proto:86-95)
> srcversion: A67EB2D25C5AFBFF3D8B788
> 0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r-----
> ns:0 nr:0 dw:4 dr:1761 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4
>
>
> And on node laplace it says:
>
> laplace:~ # cat /proc/drbd
> version: 8.3.9 (api:88/proto:86-95)
> srcversion: A67EB2D25C5AFBFF3D8B788
> 0: cs:StandAlone ro:Secondary/Unknown ds:Outdated/DUnknown r-----
> ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:4
>
>
yes, and according to the RA script everything is perfect:
drbd_status() {
local rc
rc=$OCF_NOT_RUNNING
if ! is_drbd_enabled || ! [ -b "$DRBD_DEVICE" ]; then
return $rc
fi
# ok, module is loaded, block device node exists.
# lets see its status
drbd_set_status_variables
case "${DRBD_ROLE_LOCAL}" in
Primary)
rc=$OCF_RUNNING_MASTER
;;
Secondary)
rc=$OCF_SUCCESS
;;
Unconfigured)
rc=$OCF_NOT_RUNNING
;;
*)
ocf_log err "Unexpected role ${DRBD_ROLE_LOCAL}"
rc=$OCF_ERR_GENERIC
esac
return $rc
}
Staggering.
drbd_set_status_variable subroutine does set DRBD_CSTATE
I think the RA needs to be modified to something like this:
Secondary)
if [[ $DRBD_CSTATE == Connected ]]; then
rc=$OCF_SUCCESS
else
rc=$OCF_NOT_RUNNING
fi
Vadym
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems