Christine Caulfield <ccaul...@redhat.com> writes: > It seems to me that fencing is failing for some reason, though I can't > tell from the logs exactly why, so you might have to investgate your > setup for IPMI to see just what is happening (I'm no IPMI expert, > sorry).
Thanks for looking, but actually IPMI stonith is working, for all nodes I tested: stonith_adm --reboot <node> And it works. > The logs files tell me this though: > > Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request > 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith > Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result > 1084811079 pid 7358 result 1 exit status > Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status > 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035 > Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request > 1084811079 no actor > > > Showing a status code '1' from dlm_stonith - the result should be 0 if > fencing completed succesfully. But 1084811080 is nebula3 and in its logs I see: Nov 25 10:56:33 nebula3 stonith-ng[6232]: notice: can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: static-list [...] Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: log_operation: Operation 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 'Stonith-nebula2-IPMILAN' returned: 0 (OK) Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: crm_glib_handler: Forked child 7376 to record non-fatal assert at logging.c:63 : Source ID 20 was not found when attempting to remove it Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: crm_glib_handler: Forked child 7377 to record non-fatal assert at logging.c:63 : Source ID 21 was not found when attempting to remove it Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: remote_op_done: Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK Nov 25 10:56:34 nebula3 crmd[6236]: notice: tengine_stonith_notify: Peer nebula2 was terminated (reboot) by nebula1 for nebula1: OK (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038 Nov 25 10:56:34 nebula3 crmd[6236]: notice: crm_update_peer_state: tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null)) Which means to me that stonith-ng manage to fence the node and notify its success. How the “returned: 0 (OK)” could became “receive 1”? A logic issue somewhere between stonith-ng and dlm_controld? Thanks. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
signature.asc
Description: PGP signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org