On 25/11/14 19:55, Daniel Dehennin wrote:
Christine Caulfield <ccaul...@redhat.com> writes:

It seems to me that fencing is failing for some reason, though I can't
tell from the logs exactly why, so you might have to investgate your
setup for IPMI to see just what is happening (I'm no IPMI expert,
sorry).

Thanks for looking, but actually IPMI stonith is working, for all nodes
I tested:

     stonith_adm --reboot <node>

And it works.

The logs files tell me this though:

Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result
1084811079 pid 7358 result 1 exit status
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status
1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
1084811079 no actor


Showing a status code '1' from dlm_stonith - the result should be 0 if
fencing completed succesfully.

But 1084811080 is nebula3 and in its logs I see:

Nov 25 10:56:33 nebula3 stonith-ng[6232]:   notice: can_fence_host_with_device: 
Stonith-nebula2-IPMILAN can fence nebula2: static-list
[...]
Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: log_operation: Operation 
'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 
'Stonith-nebula2-IPMILAN' returned: 0 (OK)
Nov 25 10:56:34 nebula3 stonith-ng[6232]:    error: crm_abort: 
crm_glib_handler: Forked child 7376 to record non-fatal assert at logging.c:63 
: Source ID 20 was not found when attempting to remove it
Nov 25 10:56:34 nebula3 stonith-ng[6232]:    error: crm_abort: 
crm_glib_handler: Forked child 7377 to record non-fatal assert at logging.c:63 
: Source ID 21 was not found when attempting to remove it
Nov 25 10:56:34 nebula3 stonith-ng[6232]:   notice: remote_op_done: Operation 
reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK
Nov 25 10:56:34 nebula3 crmd[6236]:   notice: tengine_stonith_notify: Peer 
nebula2 was terminated (reboot) by nebula1 for nebula1: OK 
(ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038
Nov 25 10:56:34 nebula3 crmd[6236]:   notice: crm_update_peer_state: 
tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null))

Which means to me that stonith-ng manage to fence the node and notify
its success.

How the “returned: 0 (OK)” could became “receive 1”?

A logic issue somewhere between stonith-ng and dlm_controld?


it could be, I don't know enough about pacemaker to be able to comment on that, sorry. The 'no actors' message from dlm_controld worries me though.


Chrissie

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to