Hello, I'm looking for advice on how to troubleshoot an issue I experienced on 
one of my DRBD clusters. I'm running:
- Ubuntu 11.10 (GNU/Linux 3.0.0-12-server x86_64)
- DRBD 8.3.11
- Pacemaker 1.1.5 / Heartbeat 3.0.5

Primary Node: filer2a
Secondary Node: filer2b

I'm running NFS which shares house virtual machines.  I noticed VMs were 
unresponsive, so looking on the primary node I ran:

- 'netstat | grep nfs' & 'showmount -a' and saw that none of my KVM hosts were 
connected.
- 'ps aux | grep nfs' and saw NFS was still running.
- 'drbd-overview' and saw ' 0:r0  Connected Primary/Secondary UpToDate/UpToDate 
C r----- /mnt/drbd0 ext4 11T 1001G 9.1T 10%'

I went to one KVM host and could not touch a file on my NFS share. SO I 
restarted NFS and within seconds saw the KVM hosts reconnect to the share. And 
to my surprise, I was able to login the, once hung, VMs without any issues, 
guess NFS cache did its job?!

I've looked in my syslog file and I've included below the timeframe of logs I 
believe is when an issue occurred, 12:58p...

Can anyone help me decipher what the issue may have been or to the more 
experienced DRBD admins, is there something completely obvious here. Are there 
other logs I should be looking at?

Thank you!
Basher

Syslog:

Apr 30 12:58:20 Filer2A lrmd: [1684]: WARN: res_IPaddr2_1:monitor process (PID 
18454) timed out (try 1).  Killing with signal SIGTERM (15).
Apr 30 12:58:20 Filer2A lrmd: [1684]: WARN: operation monitor[11] on 
ocf::IPaddr2::res_IPaddr2_1 for client 1687, its parameters: 
CRM_meta_start_delay=[0] crm_feature_set=[3.0.5] CRM_meta_name=[monitor] 
CRM_meta_interval=[10000] CRM_meta_timeout=[20000] ip=[10.254.253.205] : pid 
[18454] timed out
Apr 30 12:58:20 Filer2A crmd: [1687]: ERROR: process_lrm_event: LRM operation 
res_IPaddr2_1_monitor_10000 (11) Timed Out (timeout=20000ms)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: process_graph_event: Action 
res_IPaddr2_1_monitor_10000/44 
(2:-2;44:8:0:198b5a2e-56cd-4d8a-9c0a-694ad16b8e82) initiated by a different 
transitioner
Apr 30 12:58:21 Filer2A crmd: [1687]: info: abort_transition_graph: 
process_graph_event:467 - Triggered transition abort (complete=1, 
tag=lrm_rsc_op, id=res_IPaddr2_1_monitor_10000, 
magic=2:-2;44:8:0:198b5a2e-56cd-4d8a-9c0a-694ad16b8e82, cib=0.18.529) : Foreign 
event
Apr 30 12:58:21 Filer2A crmd: [1687]: WARN: update_failcount: Updating 
failcount for res_IPaddr2_1 on filer2a after failed monitor: rc=-2 
(update=value++, time=1335808701)
Apr 30 12:58:21 Filer2A attrd: [1686]: info: find_hash_entry: Creating hash 
entry for fail-count-res_IPaddr2_1
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_local_callback: Expanded 
fail-count-res_IPaddr2_1=value++ to 1
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_trigger_update: Sending 
flush op to all hosts for: fail-count-res_IPaddr2_1 (1)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: All 2 cluster 
nodes are eligible to run resources.
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_pe_invoke: Query 4764: 
Requesting the current CIB: S_POLICY_ENGINE
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_perform_update: Sent update 
127: fail-count-res_IPaddr2_1=1
Apr 30 12:58:21 Filer2A attrd: [1686]: info: find_hash_entry: Creating hash 
entry for last-failure-res_IPaddr2_1
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_trigger_update: Sending 
flush op to all hosts for: last-failure-res_IPaddr2_1 (1335808701)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_pe_invoke_callback: Invoking the 
PE: query=4764, ref=pe_calc-dc-1335808701-4715, seq=14, quorate=1
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_rsc_op: Operation 
res_drbd_1:1_monitor_0 found resource res_drbd_1:1 active on filer2b
Apr 30 12:58:21 Filer2A pengine: [5000]: WARN: unpack_rsc_op: Processing failed 
op res_IPaddr2_1_monitor_10000 on filer2a: unknown exec error (-2)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_rsc_op: Operation 
res_drbd_1:0_monitor_0 found resource res_drbd_1:0 active on filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: clone_print:  Master/Slave 
Set: ms_drbd_1 [res_drbd_1]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: short_print:      Masters: [ 
filer2a ]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: short_print:      Slaves: [ 
filer2b ]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: 
res_Filesystem_1#011(ocf::heartbeat:Filesystem):#011Started filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: 
res_IPaddr2_1#011(ocf::heartbeat:IPaddr2):#011Started filer2a FAILED
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: 
res_nfs-kernel-server_1#011(lsb:nfs-kernel-server):#011Started filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: common_apply_stickiness: 
res_nfs-kernel-server_1 can fail 999997 more times on filer2a before being 
forced off
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: RecurringOp:  Start recurring 
monitor (10s) for res_IPaddr2_1 on filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   
res_drbd_1:0#011(Master filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   
res_drbd_1:1#011(Slave filer2b)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   
res_Filesystem_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Recover 
res_IPaddr2_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   
res_nfs-kernel-server_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Apr 30 12:58:21 Filer2A crmd: [1687]: info: unpack_graph: Unpacked transition 
4619: 4 actions in 4 synapses
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_te_invoke: Processing graph 4619 
(ref=pe_calc-dc-1335808701-4715) derived from /var/lib/pengine/pe-input-4664.bz2
Apr 30 12:58:21 Filer2A crmd: [1687]: info: te_rsc_command: Initiating action 
5: stop res_IPaddr2_1_stop_0 on filer2a (local)
Apr 30 12:58:21 Filer2A lrmd: [1684]: info: cancel_op: operation monitor[11] on 
ocf::IPaddr2::res_IPaddr2_1 for client 1687, its parameters: 
CRM_meta_start_delay=[0] crm_feature_set=[3.0.5] CRM_meta_name=[monitor] 
CRM_meta_interval=[10000] CRM_meta_timeout=[20000] ip=[10.254.253.205]  
cancelled
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_lrm_rsc_op: Performing 
key=5:4619:0:600e8087-898a-424d-9199-a8dc2b8898c0 op=res_IPaddr2_1_stop_0 )
Apr 30 12:58:21 Filer2A lrmd: [1684]: info: rsc:res_IPaddr2_1:37: stop
Apr 30 12:58:21 Filer2A crmd: [1687]: info: process_lrm_event: LRM operation 
res_IPaddr2_1_monitor_10000 (call=11, status=1, cib-update=0, confirmed=true) 
Cancelled
Apr 30 12:58:21 Filer2A crmd: [1687]: info: abort_transition_graph: 
te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, 
id=status-1c5ace0e-6f0e-4b34-817c-b4ea48deef8e-fail-count-res_IPaddr2_1, 
magic=NA, cib=0.18.530) : Transient attribute: update
Apr 30 12:58:21 Filer2A crmd: [1687]: info: update_abort_priority: Abort 
priority upgraded from 0 to 1000000
Apr 30 12:58:21 Filer2A crmd: [1687]: info: update_abort_priority: Abort action 
done superceeded by restart
Apr 30 12:58:21 Filer2A attrd: [1686]: info: attrd_perform_update: Sent update 
130: last-failure-res_IPaddr2_1=1335808701
Apr 30 12:58:21 Filer2A crmd: [1687]: info: abort_transition_graph: 
te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, 
id=status-1c5ace0e-6f0e-4b34-817c-b4ea48deef8e-last-failure-res_IPaddr2_1, 
magic=NA, cib=0.18.531) : Transient attribute: update
Apr 30 12:58:21 Filer2A IPaddr2[18526]: INFO: IP status = ok, IP_CIP=
Apr 30 12:58:21 Filer2A crmd: [1687]: info: process_lrm_event: LRM operation 
res_IPaddr2_1_stop_0 (call=37, rc=0, cib-update=4765, confirmed=true) ok
Apr 30 12:58:21 Filer2A crmd: [1687]: info: match_graph_event: Action 
res_IPaddr2_1_stop_0 (5) confirmed on filer2a (rc=0)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: run_graph: 
====================================================
Apr 30 12:58:21 Filer2A crmd: [1687]: notice: run_graph: Transition 4619 
(Complete=1, Pending=0, Fired=0, Skipped=3, Incomplete=0, 
Source=/var/lib/pengine/pe-input-4664.bz2): Stopped
Apr 30 12:58:21 Filer2A crmd: [1687]: info: te_graph_trigger: Transition 4619 
is now complete
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: All 2 cluster 
nodes are eligible to run resources.
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_pe_invoke: Query 4766: 
Requesting the current CIB: S_POLICY_ENGINE
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_pe_invoke_callback: Invoking the 
PE: query=4766, ref=pe_calc-dc-1335808701-4717, seq=14, quorate=1
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_rsc_op: Operation 
res_drbd_1:1_monitor_0 found resource res_drbd_1:1 active on filer2b
Apr 30 12:58:21 Filer2A pengine: [5000]: WARN: unpack_rsc_op: Processing failed 
op res_IPaddr2_1_monitor_10000 on filer2a: unknown exec error (-2)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: unpack_rsc_op: Operation 
res_drbd_1:0_monitor_0 found resource res_drbd_1:0 active on filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: clone_print:  Master/Slave 
Set: ms_drbd_1 [res_drbd_1]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: short_print:      Masters: [ 
filer2a ]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: short_print:      Slaves: [ 
filer2b ]
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: 
res_Filesystem_1#011(ocf::heartbeat:Filesystem):#011Started filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: 
res_IPaddr2_1#011(ocf::heartbeat:IPaddr2):#011Stopped
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: native_print: 
res_nfs-kernel-server_1#011(lsb:nfs-kernel-server):#011Started filer2a
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: common_apply_stickiness: 
res_IPaddr2_1 can fail 999999 more times on filer2a before being forced off
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: common_apply_stickiness: 
res_nfs-kernel-server_1 can fail 999997 more times on filer2a before being 
forced off
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: RecurringOp:  Start recurring 
monitor (10s) for res_IPaddr2_1 on filer2b
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   
res_drbd_1:0#011(Master filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   
res_drbd_1:1#011(Slave filer2b)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   
res_Filesystem_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Start   
res_IPaddr2_1#011(filer2b)
Apr 30 12:58:21 Filer2A pengine: [5000]: notice: LogActions: Leave   
res_nfs-kernel-server_1#011(Started filer2a)
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Apr 30 12:58:21 Filer2A crmd: [1687]: info: unpack_graph: Unpacked transition 
4620: 2 actions in 2 synapses
Apr 30 12:58:21 Filer2A crmd: [1687]: info: do_te_invoke: Processing graph 4620 
(ref=pe_calc-dc-1335808701-4717) derived from /var/lib/pengine/pe-input-4665.bz2
Apr 30 12:58:21 Filer2A crmd: [1687]: info: te_rsc_command: Initiating action 
41: start res_IPaddr2_1_start_0 on filer2b
Apr 30 12:58:22 Filer2A crmd: [1687]: info: match_graph_event: Action 
res_IPaddr2_1_start_0 (41) confirmed on filer2b (rc=0)
Apr 30 12:58:22 Filer2A crmd: [1687]: info: te_rsc_command: Initiating action 
42: monitor res_IPaddr2_1_monitor_10000 on filer2b
Apr 30 12:58:24 Filer2A crmd: [1687]: info: match_graph_event: Action 
res_IPaddr2_1_monitor_10000 (42) confirmed on filer2b (rc=0)
Apr 30 12:58:24 Filer2A crmd: [1687]: info: run_graph: 
====================================================
Apr 30 12:58:24 Filer2A crmd: [1687]: notice: run_graph: Transition 4620 
(Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-4665.bz2): Complete
Apr 30 12:58:24 Filer2A crmd: [1687]: info: te_graph_trigger: Transition 4620 
is now complete
Apr 30 12:58:24 Filer2A crmd: [1687]: info: notify_crmd: Transition 4620 
status: done - <null>
Apr 30 12:58:24 Filer2A crmd: [1687]: info: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 30 12:58:24 Filer2A crmd: [1687]: info: do_state_transition: Starting 
PEngine Recheck Timer

NOTE:  This email, including any attached files, is confidential, may be 
legally privileged, and is solely for the intended recipient(s).  If you 
received this email in error, please destroy it and notify us immediately by 
reply email or phone (636-519-7070).  Any unauthorized use, dissemination, 
disclosure, copying or printing is strictly prohibited.
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to