Hi all In a two-node cluster that manages an NFS server on top of a DRBD replication, we experienced fencing of the active node due to exportfs monitor failure:
Mar 8 03:08:42 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:monitor process (PID 4120) timed out (try 1). Killing with signal SIGTERM (15). Mar 8 03:08:47 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:monitor process (PID 4120) timed out (try 2). Killing with signal SIGKILL (9). Mar 8 03:08:52 vicestore1 lrmd: [1550]: ERROR: TrackedProcTimeoutFunction: p_exportfs_virtual:monitor process (PID 4120) will not die! Mar 8 03:08:53 vicestore1 lrmd: [1550]: WARN: operation monitor[47] on p_exportfs_virtual for client 1553: pid 4120 timed out Mar 8 03:08:53 vicestore1 crmd: [1553]: ERROR: process_lrm_event: LRM operation p_exportfs_virtual_monitor_5000 (47) Timed Out (timeout=120000ms) Later on, the cluster tries to stop this specific RA and this also triggers a timeout: Mar 8 03:10:49 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop process (PID 5528) timed out (try 1). Killing with signal SIGTERM (15). Mar 8 03:10:54 vicestore1 lrmd: [1550]: WARN: p_exportfs_virtual:stop process (PID 5528) timed out (try 2). Killing with signal SIGKILL (9). Mar 8 03:10:59 vicestore1 lrmd: [1550]: ERROR: TrackedProcTimeoutFunction: p_exportfs_virtual:stop process (PID 5528) will not die! Mar 8 03:11:29 vicestore1 crmd: [1553]: WARN: action_timer_callback: Timer popped (timeout=20000, abort_level=0, complete=false) Mar 8 03:11:29 vicestore1 crmd: [1553]: ERROR: print_elem: Aborting transition, action lost: [Action 7]: Completed (id: p_drbd_nfs:1_monitor_15000, loc: vicestore1, priority: 0) Mar 8 03:11:29 vicestore1 crmd: [1553]: info: abort_transition_graph: action_timer_callback:535 - Triggered transition abort (complete=0) : Action lost Mar 8 03:11:47 vicestore1 lrmd: [1550]: WARN: operation stop[50] on p_exportfs_v As you can see, the last message is cut in the middle, because the node got fenced at that exact time. This happens quite rarely, read every few months. Googling "TrackedProcTimeoutFunction exportfs" didn't reveal any results, which makes me think we are alone with this specific problem. Is it the RA that hangs or the command 'exportfs' which is executed by this RA? This on Debian Squeeze with pacemaker 1.1.7. Any hints are welcome Roman _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
