List,

I've got a dev cluster up and running with Xen/DRBD/heartbeat working.  After a 
day or so of running, i saw that stonith had
failed to start on node2(it initially started just fine).  I have seen this 
behavior before with this cluster.

What would cause the stonith 'start' operation to fail after it initially had 
succeeded?


crm_mon output:
---------------------------
Refresh in 10s...

============
Last updated: Wed Aug 19 06:33:12 2009
Current DC: node1 (47d563cc-f8ec-4b6d-8092-d80ceb64dbbd)
2 Nodes configured.
4 Resources configured.
============

Node: node2 (c95ba6f0-5dcf-41d3-abb0-25e55ae313eb): online
Node: node1 (47d563cc-f8ec-4b6d-8092-d80ceb64dbbd): online

xen1 (heartbeat::ocf:Xen):   Started node2
xen2 (heartbeat::ocf:Xen):   Started node1
xen3 (heartbeat::ocf:Xen):   Started node2
Clone Set: Stonith_Clone_Group
    stonithclone:0      (stonith:external/ssh): Started node1
    stonithclone:1      (stonith:external/ssh): Stopped

Failed actions:
    stonithclone:1_start_0 (node=node2, call=14, rc=1): complete


At first look, it appears that the monitor operation fails. Heartbeat then 
tries to start stonith on the failed node and then
the 'start' operation fails as well.

Aug 18 11:02:37 node1 tengine: [3950]: WARN: update_failcount: Updating 
failcount for stonithclone:1 on
c95ba6f0-5dcf-41d3-abb0-25e55ae313eb after failed monitor: rc=14
Aug 18 11:02:37 node1 crmd: [3859]: info: do_state_transition: State transition 
S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_IPC_MESSAGE origin=route_message ]
Aug 18 11:02:37 node1 crmd: [3859]: info: do_state_transition: All 2 cluster 
nodes are eligible to run resources.
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node 
node1 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node 
node2 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: unpack_find_resource: Internally 
renamed stonithclone:0 on node2 to stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: WARN: unpack_rsc_op: Processing failed 
op stonithclone:1_monitor_5000 on node2: Error
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: 
romulus#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: 
remus#011(heartbeat::ocf:Xen):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: 
fortuna#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: clone_print: Clone Set: 
Stonith_Clone_Group
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:     
stonithclone:0#011(stonith:external/ssh):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:     
stonithclone:1#011(stonith:external/ssh):#011Started node2
FAILED
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource 
xen2#011(node2)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource 
xen1#011(node1)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource 
xen3#011(node2)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Leave resource 
stonithclone:0#011(node1)
Aug 18 11:02:37 node1 pengine: [3951]: notice: NoRoleChange: Recover resource 
stonithclone:1#011(node2)
Aug 18 11:02:37 node1 pengine: [3951]: notice: StopRsc:   node2#011Stop 
stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: notice: StartRsc:  node2#011Start 
stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: notice: RecurringOp: node2#011   
stonithclone:1_monitor_5000
Aug 18 11:02:37 node1 tengine: [3950]: info: extract_event: Aborting on 
transient_attributes changes for
c95ba6f0-5dcf-41d3-abb0-25e55ae313eb
Aug 18 11:02:37 node1 pengine: [3951]: info: process_pe_message: Transition 3: 
PEngine Input stored in:
/var/lib/heartbeat/pengine/pe-input-31.bz2
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node 
node1 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: determine_online_status: Node 
node2 is online
Aug 18 11:02:37 node1 pengine: [3951]: info: unpack_find_resource: Internally 
renamed stonithclone:0 on node2 to stonithclone:1
Aug 18 11:02:37 node1 pengine: [3951]: WARN: unpack_rsc_op: Processing failed 
op stonithclone:1_monitor_5000 on node2: Error
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: 
xen2#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: 
xen1#011(heartbeat::ocf:Xen):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print: 
xen3#011(heartbeat::ocf:Xen):#011Started node2
Aug 18 11:02:37 node1 pengine: [3951]: notice: clone_print: Clone Set: 
Stonith_Clone_Group
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:     
stonithclone:0#011(stonith:external/ssh):#011Started node1
Aug 18 11:02:37 node1 pengine: [3951]: notice: native_print:     
stonithclone:1#011(stonith:external/ssh):#011Started node2
FAILED


If the node gets rebooted, it comes back with everything working as expected, 
for a while then it happens again.


Any insight would be greatly appreciated.



regards,


_Terry


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to