Re: [Linux-HA] HA2 OCF CRM: Manage multiple DRBD Resources

Dominik Klein Wed, 04 Jul 2007 06:04:57 -0700

My drbd setup is working. I can manually set each note to be primary foreach resource (while the other is secondary of course). When startingheartbeat, I make sure every drbd device is either Unconfigured/down orsecondary.
Just don't start drbd at boot. If it's running anyway, heartbeat should
probe and find out that the resources are running.

Okay. As you will see, this leads into other problems (which I solved),but it does not change my main problem.


So here's what I do:

"1:" means its done on machine 1
"2:" means its done on machine 2
#: means its done on both machines
<...> is a comment by me

#: reboot
<wait> :)
#: ls /proc/drbd

ls: Zugriff auf /proc/drbd nicht möglich: Datei oder Verzeichnis nichtgefunden

<german for: file does not exist>

<make sure we start off clean>
# rm /var/lib/heartbeat/crm/*

# /etc/init.d/heartbeat start
<wait again>
<crm_mon shows 2 online nodes, 0 resources>

1: cibadmin -U -x cib.xml (all target roles = stopped, no instanceattributes for any node)

<crm_mon show 2 online nodes, DC=acd-xen03, *4* resources>

1: crm_resource -r ms-r0 -v 'started' -p target_role
1: crm_resource -r fs0 -v 'started' -p target_role
# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2007-06-11 14:48:25
<this means: no resources configured>
<crm_mon shows r0 "started" for both nodes -> not good>

1: drbdadm state r0
Unknown/TOO_LARGE

<OCF script needs to be changed to recognize this (maybe new drbd8)state after just the module being loaded>

<done>

So except for changing and copying the script, I started over fromreboot up to target_role=started for fs0

<now crm_mon show r0:0 on acd-xen03 is master>
<fs0 is mounted on acd-xen03>
<2 online nodes, *4* resources>

Now comes the strange thing:
1: crm_resource -r ms-r1 -v 'started' -p target_role
1: crm_resource -r fs1 -v 'started' -p target_role
<crm_mon shows 2 online nodes, DC still acd-xen03, but *5* resources (+1)>
<one would expect to see the same result as with r0, but:
<crm_mon show started on both, no master
# cat /proc/drbd
version: 8.0.3 (api:86/proto:86)
SVN Revision: 2881 build by [EMAIL PROTECTED], 2007-06-11 14:48:25
...
 1: cs:Connected st:Secondary/Secondary ds:UpToDate/UpToDate C r---
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
        resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
        act_log: used:0/127 hits:0 misses:0 starving:0 dirty:0 changed:0
<confirmed, no master>
<fs1 is not mounted>

<logs suggest to ask crm_verify, I strip datetime values for readability>
1: crm_verify -LVVVV

info: log_data_element: create_fake_resource: Orphan resource<lrm_resource id="r1:1" type="drbd_master_slave" class="ocf" provider="dk">info: log_data_element: create_fake_resource: Orphan resource<lrm_rsc_op id="r1:1_monitor_0" operation="monitor"crm-debug-origin="do_update_resource"transition_key="7:4:0e0997d6-d70e-4b9e-949b-0c98684f6259"transition_magic="0:7;7:4:0e0997d6-d70e-4b9e-949b-0c98684f6259"call_id="7" crm_feature_set="1.0.7" rc_code="7" op_status="0"interval="0" op_digest="58850437bf287086d1b41caade76bbf1"/>info: log_data_element: create_fake_resource: Orphan resource</lrm_resource>info: unpack_find_resource: Making sure orphan r1:1/r1:2 is stopped onacd-xen01

info: unpack_find_resource: Internally renamed r1:1 on acd-xen01 to r1:2

debug: unpack_rsc_order: r0_before_fs0: ms-r0.promote after fs0.start(symmetrical)debug: unpack_rsc_order: r1_before_fs1: ms-r1.promote after fs1.start(symmetrical)

debug: cib_native_signoff: Signing out of the CIB Service

<r1:2 looks suspicious - no idea where this comes from>

I get one drbd+fs pair to run (the one using r1 in my config). But whenI try to add another one (the one using r0 in my config), it does notpromote the master and therefore does not mount the fs. The OCF scripthangs and times out at "crm_master -v 75" and as you can see in thenodes section of the CIB, only the master value for r1 made it to the CIB.
The process "hangs"? How so?


The notify action times out (20s).
Here a log extract:

Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7412]: DEBUG: r1notify: post for start - counts: active 0 - starting 2 - stopping 0Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7413]: DEBUG: DKdrbd_start_phase_2 with param "no"Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7415]: DEBUG: r1:Calling /sbin/drbdadm -c /etc/drbd.conf state r1Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7419]: DEBUG: r1:Exit code 0Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7420]: DEBUG: r1:Command output: Secondary/SecondaryJul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7428]: DEBUG: r1:Calling /sbin/drbdadm -c /etc/drbd.conf cstate r1Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7432]: DEBUG: r1:Exit code 0Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7433]: DEBUG: r1:Command output: ConnectedJul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7434]: DEBUG: r1status: Secondary/Secondary local: Secondary remote: Secondaryconnection: ConnectedJul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7435]: DEBUG: DKbefore crm_master -v 75Jul 4 14:45:59 ACD-xen03 drbd_master_slave[7406]: [7436]: DEBUG: r1:Calling /usr/sbin/crm_master -v 75

########### notice: +20s

Jul 4 14:46:19 ACD-xen03 lrmd: [7073]: WARN: on_op_timeout_expired:TIMEOUT: operation notify[15] on ocf::drbd_master_slave::r1:1 for client7076, its parameters: CRM_meta_op_target_rc=[7]CRM_meta_notify_operation=[start] CRM_meta_notify_start_resource=[r1:0r1:1 ] drbd_resource=[r1] CRM_meta_master_max=[1] CRM_meta_timeout=[200.Jul 4 14:46:19 ACD-xen03 crmd: [7076]: ERROR: process_lrm_event: LRMoperation r1:1_notify_0 (15) Timed Out (timeout=20000ms)Jul 4 14:46:19 ACD-xen03 cib: [7072]: info: cib_diff_notify: Update(client: 7076, call:48): 0.1.135 -> 0.1.136 (ok)Jul 4 14:46:19 ACD-xen03 tengine: [7084]: info: te_update_diff:Processing diff (cib_update): 0.1.135 -> 0.1.136Jul 4 14:46:19 ACD-xen03 tengine: [7084]: info: match_graph_event:Action r1:1_post_notify_start_0 (79) confirmed ond4506030-b86e-4877-9984-72b7b39e29caJul 4 14:46:19 ACD-xen03 cib: [7439]: info: write_cib_contents: Wroteversion 0.1.136 of the CIB to disk (digest:ba84a2cd700f604ea7aee326cc06e1b6)Jul 4 14:46:20 ACD-xen03 cib: [7072]: info: cib_diff_notify: Update(client: 7087, call:36): 0.1.136 -> 0.1.137 (ok)Jul 4 14:46:20 ACD-xen03 tengine: [7084]: info: te_update_diff:Processing diff (cib_update): 0.1.136 -> 0.1.137Jul 4 14:46:20 ACD-xen03 tengine: [7084]: info: match_graph_event:Action r1:0_post_notify_start_0 (76) confirmed onf6ffbaa8-9c5b-4da1-9e93-b50d227ba805Jul 4 14:46:20 ACD-xen03 crmd: [7076]: info: do_state_transition:acd-xen03: State transition S_TRANSITION_ENGINE -> S_IDLE [input=I_TE_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]

Have you tried stracing the crm_master
process?

No.

I recall that I had some issues with drbd complaining about resources
which mentioned nodes which weren't local; I worked around that by
splitting drbd.conf into several parts and giving each drbd resource its
own separate configfile using the drbdconf attribute.


Don't know if that would help.

Well, here's hoping that this change of yours truly is the only one
needed to fully support drbd8 ;-)

Well the drbdadm commands issued from the script seem to be the same. Asyou have read earlier, I added some more Status strings to look out for,but you are right, I do not know for sure if this is all that needs tobe changed.

Please note that this behaviour is not dependant on my r0 or r1resource. If I start out with r0, r0 works and r1 faults. If I start theother way around with r1, then r0 will fault.


Maybe you can still help me figure this out.

Regards
Dominik
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] HA2 OCF CRM: Manage multiple DRBD Resources

Reply via email to