Re: Re: [Linux-HA] DRBD in a 2 node cluster

jayfitzpatrick Thu, 12 Feb 2009 05:23:48 -0800

return codes

[r...@lpissan1001 ~]# service heartbeat stop
Stopping High-Availability services:
[ OK ]
[r...@lpissan1001 ~]# drbdadm up Storage1
/dev/drbd0: Failure: (124) Device is attached to a disk (use detach first)

Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal--set-defaults --create-device --on-io-error=pass_on' terminated with exitcode 10

[r...@lpissan1001 ~]# drbdadm down Storage1
[r...@lpissan1001 ~]# drbdadm up Storage1
[r...@lpissan1001 ~]# drbdadm up Storage1
/dev/drbd0: Failure: (124) Device is attached to a disk (use detach first)

Command 'drbdsetup /dev/drbd0 disk /dev/sdb /dev/sdb internal--set-defaults --create-device --on-io-error=pass_on' terminated with exitcode 10

[r...@lpissan1001 ~]# echo $?
1
[r...@lpissan1001 ~]# drbdadm down Storage1
[r...@lpissan1001 ~]# echo $?
0
[r...@lpissan1001 ~]# cat /proc/drbd
version: 8.3.0 (api:88/proto:86-89)

GIT-hash: 9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829 build by[email protected], 2009-02-12 13:13:30

0: cs:Unconfigured
[r...@lpissan1001 ~]#



Jason


On Feb 12, 2009 12:24pm, Dominik Klein <[email protected]> wrote:

Did you look into the returncodes and eventually tell linbit about it?



That would be a big issue.



Regards

Dominik



[email protected] wrote:

> (Big Cheers and celebrations from this end!!!)

>

> Finally figured out what the problem was, it seems that the kernel oops

> were being caused by the 8.3 version of DRBD, once downgraded to 8.2.7

> everything started to work as it should. primary / secondary automatic

> fail over is in place and resources are now following the DRBD master!

>

> Thanks a mill for all the help.

>

> Jason

>

> On Feb 12, 2009 8:48am, Jason Fitzpatrick [email protected]>

wrote:

>> Hi Dominik

>>

>> thanks again for the feedback,

>>
>> I had noticed some kernel opps's since the last kernel update that i

and


> they seem to be pointing to DRBD, i will downgrade the kernel again and

> see if this improves things,

>>

>>

>> re Stonith I Uninstalled as part of the move from heartbeat v2.1 to 2.9

> but must have missed this bit.

>>

>> user land and kernel module all report the same version.

>>

>> I am on my way into the office now and I will apply the changes once

>> there

>>

>>

>> thanks again

>>

>> Jason

>>

>> 2009/2/12 Dominik Klein [email protected]>

>>

>> Right, this one looks better.

>>

>>

>>

>> I'll refer to nodes as 1001 and 1002.

>>

>>

>>

>> 1002 is your DC.

>>

>> You have stonith enabled, but no stonith devices. Disable stonith or

get


>>

>> and configure a stonith device (_please_ dont use ssh).

>>

>>

>>

>> 1002 ha-log lines 926:939, node 1002 wants to shoot 1001, but cannot (l

>>

>> 978). Retries in l 1018 and fails again in l 1035.

>>

>>

>>

>> Then, the cluster tries to start drbd on 1001 in l 1079, followed by a

>>

>> bunch of kernel messages I don't understand (pretty sure _this_ is the

>>

>> first problem you should address!), ending up in the drbd RA not able

to


>>

>> see the secondary state (1449) and considering the start failed.

>>

>>

>>

>> The RA code for this is

>>

>> if do_drbdadm up $RESOURCE ; then

>>

>> drbd_get_status

>>

>> if [ "$DRBD_STATE_LOCAL" != "Secondary" ]; then

>>

>> ocf_log err "$RESOURCE start: not in Secondary mode after start."

>>

>> return $OCF_ERR_GENERIC

>>

>> fi

>>

>> ocf_log debug "$RESOURCE start: succeeded."

>>

>> return $OCF_SUCCESS

>>

>> else

>>

>> ocf_log err "$RESOURCE: Failed to start up."

>>

>> return $OCF_ERR_GENERIC

>>

>> fi

>>

>>

>>

>> The cluster then successfully stops drbd again (l 1508-1511) and tries

>>

>> to start the other clone instance (l 1523).

>>

>>

>>

>> Log says

>>

>> RA output: (Storage1:1:start:stdout) /dev/drbd0: Failure: (124) Device

>>

>> is attached to a disk (use detach first) Command 'drbdsetup /dev/drbd0

>>

>> disk /dev/sdb /dev/sdb internal --set-defaults --create-device

>>

>> --on-io-error=pass_on' terminated with exit code 10

>>

>>

>> Feb 11 15:39:05 lpissan1002 drbd[3473]: ERROR: Storage1 start: not in

>>

>> Secondary mode after start.

>>

>>

>>

>> So this is interesting. Although "stop" (basically drbdadm down)

>>

>> succeeded, the drbd device is still attached!

>>

>>

>>

>> Please try:

>>

>> stop the cluster

>>

>> drbdadm up $resource

>>

>> drbdadm up $resource #again

>>

>> echo $?

>>

>> drbdadm down $resource

>>

>> echo $?

>>

>> cat /proc/drbd

>>

>>

>>

>> Btw: Does your userland match your kernel module version?

>>

>>

>>

>> To bring this to an end: The start of the second clone instance also

>>

>> failed, so both instances are unrunnable on the node and no further

>>

>> start is tried on 1002.

>>

>>

>>

>> Interestingly, then (could not see any attempt before), the cluster

>>

>> wants to start drbd on node 1001, but it also fails and also gives

those


>>

>> kernel messages. In l 2001, each instance has a failed start on each

>> node.

>>

>>

>>

>> So: Find out about those kernel messages. Can't help much on that

>>

>> unfortunately, but there were some threads about things like that on

>>

>> drbd-user recently. Maybe you can find answers to that problem there.

>>

>>

>>

>> And also: please verify returncodes of drbdadm in your case. Maybe

>>

>> that's a drbd tools bug? (can't say for sure, for me, up on an alreay

up


>>

>> resource gives 1, which is ok).

>>

>>

>>

>> Regards

>>

>> Dominik

>>

>>

>>

>> Jason Fitzpatrick wrote:

>>

>>

>> > it seems that I had the incorrect version of openais installed (from

>> the

>>

>> > fedora repo vs the HA one)

>>

>> >

>>

>> > I have corrected and the hb_report ran correctly using the following

>>

>> >

>>

>> > hb_report -u root -f 3pm /tmp/report

>>

>> >

>>

>> > Please see attached

>>

>> >

>>

>> > Thanks again

>>

>> >

>>

>> > Jason

>>

>>

>>

>>

>>

>> _______________________________________________

>>

>> Linux-HA mailing list

>>

>> [email protected]

>>

>> http://lists.linux-ha.org/mailman/listinfo/linux-ha

>>

>> See also: http://linux-ha.org/ReportingProblems

>>

>>

>>

>>

>>

>>

> _______________________________________________

> Linux-HA mailing list

> [email protected]

> http://lists.linux-ha.org/mailman/listinfo/linux-ha

> See also: http://linux-ha.org/ReportingProblems

>



_______________________________________________

Linux-HA mailing list

[email protected]

http://lists.linux-ha.org/mailman/listinfo/linux-ha

See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: Re: [Linux-HA] DRBD in a 2 node cluster

Reply via email to