Re: [Pacemaker] inconsistence in crm_mon and crm resource show

Janec, Jozef Wed, 21 Mar 2012 09:58:15 -0700

Fixed now,

By mistake I removed property stonith-enabled=false, and therefore the second 
node always tried fence the second node which crashed/was rebooted. Result was 
that all resources were down and waiting till fence will return done.
After I have returned the parameter back, the behavior is as expected, 
resources are started on the second node without fence wait


Best regards

Jozef

-----Original Message-----
From: Janec, Jozef 
Sent: Wednesday, March 21, 2012 11:47 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] inconsistence in crm_mon and crm resource show

> 
> On 2012-03-21T09:42:26, "Janec, Jozef" <jozef.ja...@hp.com> wrote:
> 
> > Node b300ple0: UNCLEAN (offline)
> >         rs_nw_dbjj7     (ocf::heartbeat:IPaddr) Started
> >         rs_nw_cijj7     (ocf::heartbeat:IPaddr) Started
> > Node b400ple0: online
> >         sbd_fense_SHARED2       (stonith:external/sbd) Started
> >
> > Inactive resources:
> >
> > rs_nw_cijj7    (ocf::heartbeat:IPaddr):        Started b300ple0
> > rs_nw_dbjj7    (ocf::heartbeat:IPaddr):        Started b300ple0
> >
> > b400ple0:(/root/home/root)(root)#crm resource show
> > rs_nw_cijj7    (ocf::heartbeat:IPaddr) Started
> > sbd_fense_SHARED2      (stonith:external/sbd) Started
> > rs_nw_dbjj7    (ocf::heartbeat:IPaddr) Started
> > b400ple0:(/root/home/root)(root)#
> >
> > b400ple0:(/root/home/root)(root)#/usr/sbin/crm_resource -W -r
> > rs_nw_cijj7 resource rs_nw_cijj7 is running on: b300ple0 
> > b400ple0:(/root/home/root)(root)#
> >
> > but b300ple0 is down
> 
> Resources are still considered owned because the node wasn't fenced yet.
> 

[Jozef Janec]
Yes I can see in logs:

Mar 21 06:18:00 b400ple0 stonith-ng: [8603]: ERROR: log_operation: Operation 
'reboot' [3159] for host 'b300ple0' with device 'sbd_fense_SHARED2' returned: 1 
(call 0 from (null)) Mar 21 06:18:00 b400ple0 stonith-ng: [8603]: info: 
process_remote_stonith_execExecResult <st-reply 
st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" 
st_remote_op="5cb46419-bfdb-4115-85d9-6ec447b38823" st_callid="0" 
st_callopt="0" st_rc="1" st_output="Performing: stonith -t external/sbd -T 
reset b300ple0 failed: b300ple0 0.05859375" src="b400ple0" seq="172" /> Mar 21 
06:18:06 b400ple0 stonith-ng: [8603]: ERROR: remote_op_timeout: Action reboot 
(5cb46419-bfdb-4115-85d9-6ec447b38823) for b300ple0 timed out Mar 21 06:18:06 
b400ple0 stonith-ng: [8603]: info: remote_op_done: Notifing clients of 
5cb46419-bfdb-4115-85d9-6ec447b38823 (reboot of b300ple0 from 
a8125881-30df-4bd4-a5b1-666020a29eba by (null)): 1, rc=-7 Mar 21 06:18:06 
b400ple0 crmd: [8608]: info: tengine_stonith_callbackStonithOp <remote-op 
state="1" st_target="b300ple0" st_op="reboot" /> Mar 21 06:18:06 b400ple0 
stonith-ng: [8603]: info: stonith_notify_client: Sending st_fence-notification 
to client 8608/bc1b0c7d-2cec-4e96-9523-5f6c51b52508
Mar 21 06:18:06 b400ple0 crmd: [8608]: info: tengine_stonith_callback: Stonith 
operation 44/15:49:0:44f2b175-7292-473a-a4e8-f9abda5b3ef6: Operation timed out 
(-7) Mar 21 06:18:06 b400ple0 crmd: [8608]: ERROR: tengine_stonith_callback: 
Stonith of b300ple0 failed (-7)... aborting transition.
Mar 21 06:18:06 b400ple0 crmd: [8608]: info: abort_transition_graph: 
tengine_stonith_callback:401 - Triggered transition abort (complete=0) : 
Stonith failed


Because I reboted the ndoe manualy to simulate outage, and I haven't started 
the rcopenais the sbd daemon isn't started yet too

b400ple0:(/var/log/ha)(root)#/usr/sbin/sbd -d /dev/mapper/SHARED1_part1 list
0       b400ple0        clear
1       b300ple0        reset   b400ple0
b400ple0:(/var/log/ha)(root)#/usr/sbin/sbd -d /dev/mapper/SHARED2_part1  list
0       b300ple0        reset   b400ple0
1       b400ple0        clear

It is waiting till the sbd will pick up the command and reset this.

Question is where is located the information that the resource is still up it 
is in lrm part? I have found that I can use crm node clearstate which should 
set offline state on node and probably release the resources, but I want to 
find where exactly it is hidden. All information are located or should be 
located in cib, and I would like to know exactly which one is responsible for 
this behavior to understand it better

Best regards

Jozef

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] inconsistence in crm_mon and crm resource show

Reply via email to