Tadashiro Yoshida wrote:
Hi,

We detected some errors in the ComponentFail test while running CTS with a dev 
version. It might be a CTS's problem with message handling.
Please check it if it happens something wrong.

Dev version: 76c25be5c854
# python CTSlab.py -v2 -r -c
  --facility local7 -L /var/log/ha-log-local7 500 2>&1 | tee cts.log

-----------
Nov 26 19:22:09 x3650a CTS: Running test ComponentFail (x3650b) [16]
Nov 26 19:22:10 x3650b heartbeat: [27967]: WARN:
                Managed /usr/lib64/heartbeat/stonithd process 27980
                killed by signal 9 [SIGKILL - Kill, unblockable].
Nov 26 19:22:10 x3650b heartbeat: [27967]: ERROR:
                Respawning client "/usr/lib64/heartbeat/stonithd":
Nov 26 19:22:10 x3650b heartbeat: [27967]: info:
                Starting child client "/usr/lib64/heartbeat/stonithd"(0,0)
Nov 26 19:22:10 x3650b stonithd: [30753]: notice:
                /usr/lib64/heartbeat/stonithd start up successfully.
   :
Nov 26 19:32:41 x3650a CTS: Patterns not found:
                ['x3650c crmd:.*LOST:.* x3650b ',
                 'Updating node state to member for x3650b']
Nov 26 19:32:41 x3650a CTS: Test ComponentFail failed
                [reason:Didn't find all expected patterns]
Nov 26 19:32:41 x3650a CTS: Test ComponentFail (x3650b) [FAILED]
-----------

I think this was a pattern problem in the messages-to-ignore, which I believe is now fixed.
http://hg.linux-ha.org/dev/rev/e4a4c6fd5649


Besides, it seams there are some failures in the stonithd testing, although the final message says it was succeeded.
-----------
Nov 26 19:54:11 x3650a CTS: BadNews: Nov 26 19:46:26 x3650b stonithd: [26162]:
  CRIT: command ssh -q -x -n -l root "x3650c" "echo 'sleep 2;
  /sbin/reboot -nf' | SHELL=/bin/sh at now >/dev/null 2>&1" failed
Nov 26 19:54:11 x3650a CTS: BadNews: Nov 26 19:46:53 x3650b stonithd: [23258]: ERROR: Failed to STONITH the node x3650c: optype=RESET,
  op_result=TIMEOUT
Nov 26 19:54:11 x3650a CTS: BadNews: Nov 26 19:46:53 x3650b tengine: [26116]:
  ERROR: tengine_stonith_callback: Stonith of x3650c failed (2)...
  aborting transition.
-----------

I need to change my testing setup to look at this. I'd heard a rumor that this was happening, but it wasn't happening to me, and no bugzilla was filed. But, I'm pretty sure it's an indication of a fault in the stonith ssh module.

I changed the code to fail-fast, which is vastly safer when you don't have STONITH available, and not harmful when you have real STONITH. However, if the ssh STONITH module can't connect to the machine it will show a failure like this one. So, I think the thing to do is to figure out how to report success in this case - in the testing STONITH module.


--
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to