[Linux-HA] RA heartbeat/exportfs hangs sporadically

Roman Haefeli Fri, 08 Mar 2013 02:56:36 -0800

Hi all

In a two-node cluster that manages an NFS server on top of a DRBD
replication, we experienced fencing of the active node due to exportfs
monitor failure:


Mar  8 03:08:42 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:monitor 
process (PID 4120) timed out (try 1).  Killing with signal SIGTERM (15).
Mar  8 03:08:47 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:monitor 
process (PID 4120) timed out (try 2).  Killing with signal SIGKILL (9).
Mar  8 03:08:52 vicestore1 lrmd: [1550]: ERROR: TrackedProcTimeoutFunction: 
p_exportfs_virtual:monitor process (PID 4120) will not die!
Mar  8 03:08:53 vicestore1 lrmd: [1550]: WARN: operation monitor[47] on 
p_exportfs_virtual for client 1553: pid 4120 timed out
Mar  8 03:08:53 vicestore1 crmd: [1553]: ERROR: process_lrm_event: LRM 
operation p_exportfs_virtual_monitor_5000 (47) Timed Out (timeout=120000ms)

Later on, the cluster tries to stop this specific RA and this also
triggers a timeout:

Mar  8 03:10:49 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop process 
(PID 5528) timed out (try 1).  Killing with signal SIGTERM (15).
Mar  8 03:10:54 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop process 
(PID 5528) timed out (try 2).  Killing with signal SIGKILL (9).
Mar  8 03:10:59 vicestore1 lrmd: [1550]: ERROR: TrackedProcTimeoutFunction: 
p_exportfs_virtual:stop process (PID 5528) will not die!
Mar  8 03:11:29 vicestore1 crmd: [1553]: WARN: action_timer_callback: Timer 
popped (timeout=20000, abort_level=0, complete=false)
Mar  8 03:11:29 vicestore1 crmd: [1553]: ERROR: print_elem: Aborting 
transition, action lost: [Action 7]: Completed (id: p_drbd_nfs:1_monitor_15000, 
loc: vicestore1, priority: 0)
Mar  8 03:11:29 vicestore1 crmd: [1553]: info: abort_transition_graph: 
action_timer_callback:535 - Triggered transition abort (complete=0) : Action 
lost
Mar  8 03:11:47 vicestore1 lrmd: [1550]: WARN: operation stop[50] on 
p_exportfs_v       

As you can see, the last message is cut in the middle, because the node
got fenced at that exact time. This happens quite rarely, read every few
months.

Googling "TrackedProcTimeoutFunction exportfs" didn't reveal any
results, which makes me think we are alone with this specific problem.
Is it the RA that hangs or the command 'exportfs' which is executed by
this RA? 

This on Debian Squeeze with pacemaker 1.1.7.

Any hints are welcome

Roman

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] RA heartbeat/exportfs hangs sporadically

Reply via email to