List,
I've got a dev cluster up and running with Xen/DRBD/heartbeat working. After a
day or so of running, i saw that stonith had
failed to start on node2(it initially started just fine). I have seen this
behavior before with this cluster.
What would cause the stonith 'start' operation to fail after it initially had
succeeded?
crm_mon output:
---------------------------
Refresh in 10s...
============
Last updated: Wed Aug 19 06:33:12 2009
Current DC: node1 (47d563cc-f8ec-4b6d-8092-d80ceb64dbbd)
2 Nodes configured.
4 Resources configured.
============
Node: node2 (c95ba6f0-5dcf-41d3-abb0-25e55ae313eb): online
Node: node1 (47d563cc-f8ec-4b6d-8092-d80ceb64dbbd): online
xen1 (heartbeat::ocf:Xen): Started node2
xen2 (heartbeat::ocf:Xen): Started node1
xen3 (heartbeat::ocf:Xen): Started node2
Clone Set: Stonith_Clone_Group
stonithclone:0 (stonith:external/ssh): Started node1
stonithclone:1 (stonith:external/ssh): Stopped
Failed actions:
stonithclone:1_start_0 (node=node2, call=14, rc=1): complete
At first look, it appears that the monitor operation fails. Heartbeat then
tries to start stonith on the failed node and then
the 'start' operation fails as well.
Aug 18 11:02:37 node1 tengine: [3950]: WARN: update_failcount: Updating
failcount for stonithclone:1 on
c95ba6f0-5dcf-41d3-abb0-25e55ae313eb after failed monitor: rc=14
Aug 18 11:02:37 node1 crmd: [3859]: info: do_state_transition: State transition
S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_IPC_MESSAGE origin=route_message ]
Aug 18 11:02:37 node1 crmd: [3859]: info: do_state_transition: All 2 cluster
nodes are eligible to run resources.
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node
node1 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node
node2 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: unpack_find_resource: Internally
renamed stonithclone:0 on node2 to stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: WARN: unpack_rsc_op: Processing failed
op stonithclone:1_monitor_5000 on node2: Error
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
romulus#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
remus#011(heartbeat::ocf:Xen):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
fortuna#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: clone_print: Clone Set:
Stonith_Clone_Group
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
stonithclone:0#011(stonith:external/ssh):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
stonithclone:1#011(stonith:external/ssh):#011Started node2
FAILED
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource
xen2#011(node2)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource
xen1#011(node1)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource
xen3#011(node2)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource
stonithclone:0#011(node1)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Recover resource
stonithclone:1#011(node2)
Aug 18 11:02:37 node1 pengine: [3951]: notice: StopRsc: node2#011Stop
stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: notice: StartRsc: node2#011Start
stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: notice: RecurringOp: node2#011
stonithclone:1_monitor_5000
Aug 18 11:02:37 node1 tengine: [3950]: info: extract_event: Aborting on
transient_attributes changes for
c95ba6f0-5dcf-41d3-abb0-25e55ae313eb
Aug 18 11:02:37 node1 pengine: [3951]: info: process_pe_message: Transition 3:
PEngine Input stored in:
/var/lib/heartbeat/pengine/pe-input-31.bz2
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node
node1 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node
node2 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: unpack_find_resource: Internally
renamed stonithclone:0 on node2 to stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: WARN: unpack_rsc_op: Processing failed
op stonithclone:1_monitor_5000 on node2: Error
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
xen2#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
xen1#011(heartbeat::ocf:Xen):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
xen3#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: clone_print: Clone Set:
Stonith_Clone_Group
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
stonithclone:0#011(stonith:external/ssh):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:
stonithclone:1#011(stonith:external/ssh):#011Started node2
FAILED
If the node gets rebooted, it comes back with everything working as expected,
for a while then it happens again.
Any insight would be greatly appreciated.
regards,
_Terry
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems