My drbd setup is working. I can manually set each note to be primary for
each resource (while the other is secondary of course). When starting
heartbeat, I make sure every drbd device is either Unconfigured/down or
secondary.
Just don't start drbd at boot. If it's running anyway, heartbeat should
probe and find out that the resources are running.
Okay. As you will see, this leads into other problems (which I solved),
but it does not change my main problem.
So here's what I do:
"1:" means its done on machine 1
"2:" means its done on machine 2
#: means its done on both machines
<...> is a comment by me
#: reboot
<wait> :)
#: ls /proc/drbd
ls: Zugriff auf /proc/drbd nicht möglich: Datei oder Verzeichnis nicht
gefunden
<german for: file does not exist>
<make sure we start off clean>
# rm /var/lib/heartbeat/crm/*
# /etc/init.d/heartbeat start
<wait again>
<crm_mon shows 2 online nodes, 0 resources>
1: cibadmin -U -x cib.xml (all target roles = stopped, no instance
attributes for any node)
<crm_mon show 2 online nodes, DC=acd-xen03, *4* resources>
1: crm_resource -r ms-r0 -v 'started' -p target_role
1: crm_resource -r fs0 -v 'started' -p target_role
# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2007-06-11 14:48:25
<this means: no resources configured>
<crm_mon shows r0 "started" for both nodes -> not good>
1: drbdadm state r0
Unknown/TOO_LARGE
<OCF script needs to be changed to recognize this (maybe new drbd8)
state after just the module being loaded>
<done>
So except for changing and copying the script, I started over from
reboot up to target_role=started for fs0
<now crm_mon show r0:0 on acd-xen03 is master>
<fs0 is mounted on acd-xen03>
<2 online nodes, *4* resources>
Now comes the strange thing:
1: crm_resource -r ms-r1 -v 'started' -p target_role
1: crm_resource -r fs1 -v 'started' -p target_role
<crm_mon shows 2 online nodes, DC still acd-xen03, but *5* resources (+1)>
<one would expect to see the same result as with r0, but:
<crm_mon show started on both, no master
# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2007-06-11 14:48:25
...
1: cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
<confirmed, no master>
<fs1 is not mounted>
<logs suggest to ask crm_verify, I strip datetime values for readability>
1: crm_verify -LVVVV
info: log_data_element: create_fake_resource: Orphan resource
<lrm_resource id="r1:1" type="drbd_master_slave" class="ocf" provider="dk">
info: log_data_element: create_fake_resource: Orphan resource
<lrm_rsc_op id="r1:1_monitor_0" operation="monitor"
crm-debug-origin="do_update_resource"
transition_key="7:4:0e0997d6-d70e-4b9e-949b-0c98684f6259"
transition_magic="0:7;7:4:0e0997d6-d70e-4b9e-949b-0c98684f6259"
call_id="7" crm_feature_set="1.0.7" rc_code="7" op_status="0"
interval="0" op_digest="58850437bf287086d1b41caade76bbf1"/>
info: log_data_element: create_fake_resource: Orphan resource
</lrm_resource>
info: unpack_find_resource: Making sure orphan r1:1/r1:2 is stopped on
acd-xen01
info: unpack_find_resource: Internally renamed r1:1 on acd-xen01 to r1:2
debug: unpack_rsc_order: r0_before_fs0: ms-r0.promote after fs0.start
(symmetrical)
debug: unpack_rsc_order: r1_before_fs1: ms-r1.promote after fs1.start
(symmetrical)
debug: cib_native_signoff: Signing out of the CIB Service
<r1:2 looks suspicious - no idea where this comes from>
I get one drbd+fs pair to run (the one using r1 in my config). But when
I try to add another one (the one using r0 in my config), it does not
promote the master and therefore does not mount the fs. The OCF script
hangs and times out at "crm_master -v 75" and as you can see in the
nodes section of the CIB, only the master value for r1 made it to the CIB.
The process "hangs"? How so?
The notify action times out (20s).
Here a log extract:
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7412]: DEBUG: r1
notify: post for start - counts: active 0 - starting 2 - stopping 0
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7413]: DEBUG: DK
drbd_start_phase_2 with param "no"
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7415]: DEBUG: r1:
Calling /sbin/drbdadm -c /etc/drbd.conf state r1
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7419]: DEBUG: r1:
Exit code 0
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7420]: DEBUG: r1:
Command output: Secondary/Secondary
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7428]: DEBUG: r1:
Calling /sbin/drbdadm -c /etc/drbd.conf cstate r1
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7432]: DEBUG: r1:
Exit code 0
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7433]: DEBUG: r1:
Command output: Connected
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7434]: DEBUG: r1
status: Secondary/Secondary local: Secondary remote: Secondary
connection: Connected
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7435]: DEBUG: DK
before crm_master -v 75
Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7436]: DEBUG: r1:
Calling /usr/sbin/crm_master -v 75
########### notice: +20s
Jul 4 14:46:19 ACD-xen03 lrmd: [7073]: WARN: on_op_timeout_expired:
TIMEOUT: operation notify[15] on ocf::drbd_master_slave::r1:1 for client
7076, its parameters: CRM_meta_op_target_rc=[7]
CRM_meta_notify_operation=[start] CRM_meta_notify_start_resource=[r1:0
r1:1 ] drbd_resource=[r1] CRM_meta_master_max=[1] CRM_meta_timeout=[200.
Jul 4 14:46:19 ACD-xen03 crmd: [7076]: ERROR: process_lrm_event: LRM
operation r1:1_notify_0 (15) Timed Out (timeout=20000ms)
Jul 4 14:46:19 ACD-xen03 cib: [7072]: info: cib_diff_notify: Update
(client: 7076, call:48): 0.1.135 -> 0.1.136 (ok)
Jul 4 14:46:19 ACD-xen03 tengine: [7084]: info: te_update_diff:
Processing diff (cib_update): 0.1.135 -> 0.1.136
Jul 4 14:46:19 ACD-xen03 tengine: [7084]: info: match_graph_event:
Action r1:1_post_notify_start_0 (79) confirmed on
d4506030-b86e-4877-9984-72b7b39e29ca
Jul 4 14:46:19 ACD-xen03 cib: [7439]: info: write_cib_contents: Wrote
version 0.1.136 of the CIB to disk (digest:
ba84a2cd700f604ea7aee326cc06e1b6)
Jul 4 14:46:20 ACD-xen03 cib: [7072]: info: cib_diff_notify: Update
(client: 7087, call:36): 0.1.136 -> 0.1.137 (ok)
Jul 4 14:46:20 ACD-xen03 tengine: [7084]: info: te_update_diff:
Processing diff (cib_update): 0.1.136 -> 0.1.137
Jul 4 14:46:20 ACD-xen03 tengine: [7084]: info: match_graph_event:
Action r1:0_post_notify_start_0 (76) confirmed on
f6ffbaa8-9c5b-4da1-9e93-b50d227ba805
Jul 4 14:46:20 ACD-xen03 crmd: [7076]: info: do_state_transition:
acd-xen03: State transition S_TRANSITION_ENGINE -> S_IDLE [
input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
Have you tried stracing the crm_master
process?
No.
I recall that I had some issues with drbd complaining about resources
which mentioned nodes which weren't local; I worked around that by
splitting drbd.conf into several parts and giving each drbd resource its
own separate configfile using the drbdconf attribute.
Don't know if that would help.
Well, here's hoping that this change of yours truly is the only one
needed to fully support drbd8 ;-)
Well the drbdadm commands issued from the script seem to be the same. As
you have read earlier, I added some more Status strings to look out for,
but you are right, I do not know for sure if this is all that needs to
be changed.
Please note that this behaviour is not dependant on my r0 or r1
resource. If I start out with r0, r0 works and r1 faults. If I start the
other way around with r1, then r0 will fault.
Maybe you can still help me figure this out.
Regards
Dominik
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems