Hi, On Mon, Jan 14, 2013 at 11:12:11PM +0100, Tobias D. Oestreicher wrote: > Hi all, > > I've written a small patch for externel/ipmi, so it's possible to > configure it not to reset a node, but trigger a crashdump via NMI. > > If a node becomes unavailable for several reasons it will be fenced but > this makes investigating the root cause of the nodes unavailability very > difficult; if you have a chashdump you can reconstruct the root cause. > > For this I added 3 new options: > > crashdump -> set this to true to enable crashdump. > > sshcheck -> if this is true, a ssh connection will be > established to eighter $sshipaddr, if this is not > set, $hostname will be used as remoteadress. > sshipaddr -> in case ssh is listening on an other interface, > where dns isn't equal $hostname.
ssh is used only in case sshcheck is set to true? If so, then that should be mentioned in the description of sshipaddr. Further, the sshcheck parameter should come first (i.e. exchange place with sshipaddr). The test is Linux specific, that should also be noted in the parameter description. Please see below for notes on code. > Maybe it could be usefull for others too. > > For any comments, suggestions I would be glad. > > > Tobias D. Oestreicher > > -- > Tobias D. Oestreicher > Linux Consultant & Trainer > Tel.: +49-160-5329935 > Mail: [email protected] > > B1 Systems GmbH > Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de > GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537 > diff -r da5832ae23dd lib/plugins/stonith/external/ipmi > --- a/lib/plugins/stonith/external/ipmi Sun Dec 23 16:05:11 2012 +0100 > +++ b/lib/plugins/stonith/external/ipmi Mon Jan 14 22:01:57 2013 +0100 > @@ -36,7 +36,11 @@ > POWEROFF="power off" > POWERON="power on" > STATUS="power status" > +CRASHDUMP="chassis power diag" > + > IPMITOOL=${ipmitool:-"`which ipmitool 2>/dev/null`"} > +SYSCTL=`which sysctl 2>/dev/null` Normally, sysctl is in the PATH. As well as ssh (SSH_BIN below). > +SSH_OPTS="-q -o PasswordAuthentication=no -o StrictHostKeyChecking=no" Add "-l root", just in case? > > have_ipmi() { > test -x "${IPMITOOL}" > @@ -138,7 +142,11 @@ > ;; > reset) > if ipmi_is_power_on; then > - do_ipmi "${RESET}" > + if [ "${crashdump}" == "true" ]; then > + do_ipmi "${CRASHDUMP}" > + else > + do_ipmi "${RESET}" > + fi > else > do_ipmi "${POWERON}" > fi > @@ -149,11 +157,40 @@ > # the managed node. Hence, only check if we can contact the > # IPMI device with "power status" command, don't pay attention > # to whether the node is in fact powered on or off. > + if [ "${crashdump}" == "true" ]; then > + if [ "${sshcheck}" == "true" ];then This should go to a separate function, sth like check_crashdump_eligibility or check_crashdump_setup. Then you can do: if [ "${crashdump}" == "true" -a "${sshcheck}" == "true" ]; then check_crashdump_setup || exit fi Otherwise, it'd be hard to get the meaning of this big chunk of code. > + if [ -z "${hostname}" -a -z "${sshipaddr}" ]; then > + ha_log.sh err "Neigther hostname nor sshipaddr > is set, crashdump testing not possible" + ha_log.sh err "Neither ... > + elif [ -z "${sshipaddr}" ]; then > + REMOTESSHHOST="${hostname}" > + else > + REMOTESSHHOST="${sshipaddr}" > + fi > + SSH_BIN=`which ssh 2>/dev/null` > + SSH_COMMAND="${SSH_BIN} ${REMOTESSHHOST} ${SSH_OPTS}" > + remote_crashdump_state=`${SSH_COMMAND} "grep -c > crashkernel /proc/cmdline;${SYSCTL} -n kernel.unknown_nmi_panic > kernel.panic_on_unrecovered_nmi"` What if crashkernel is set to nothing? Would crash dump work then too? > + if [ $? -ne 0 ];then > + ha_log.sh err "Not possible to connect via ssh > to ${REMOTESSHHOST}" > + exit 1 > + fi > + unknown_nmi=`echo ${remote_crashdump_state}|awk '{print > $2}'` > + unrecovered_nmi=`echo ${remote_crashdump_state}|awk > '{print $3}'` > + crashdump_kernel_option=`echo > ${remote_crashdump_state}|awk '{print $1}'` > + if [ ${crashdump_kernel_option} -ne 1 ];then > + ha_log.sh err "Crashdump seems not to be > configured on host ${REMOTESSHHOST}" > + exit 1 > + fi > + if [ ${unknown_nmi} -eq 0 -o ${unrecovered_nmi} -eq 0 > ]; then > + ha_log.sh err "Non Maskerable Interupts do > not trigger a reset. Set \"kernel.unknown_nmi_panic\" and > \"kernel.panic_on_unrecovered_nmi\" to \"1\"" Replace "Non Maskerable Interupts" with NMI. Easier to read. And if they don't know what it is, then this message has been wasted anyway. Cheers, Dejan > + exit 1 > + fi > + fi > + fi > do_ipmi "${STATUS}" > exit $? > ;; > getconfignames) > - for i in hostname ipaddr userid passwd interface; do > + for i in hostname ipaddr userid passwd interface crashdump sshipaddr > sshcheck; do > echo $i > done > exit 0 > @@ -266,6 +303,39 @@ > </longdesc> > </parameter> > > +<parameter name="crashdump" unique="0" required="0"> > +<content type="string" default="false"/> > +<shortdesc lang="en"> > +Trigger Crahdump > +</shortdesc> > +<longdesc lang="en"> > +Instead of sending a reset to the IPMI board, submit a NMI signal to trigger > a crashdump. > + > +!!! ATTENTION USE ONLY FOR DEBUGGING PURPOSES. NMI MUST BE TESTED PRIOR TO > USE !!! > +</longdesc> > +</parameter> > + > +<parameter name="sshipaddr" unique="0"> > +<content type="string" /> > +<shortdesc lang="en"> > +IP Address of the node to stonith. > +</shortdesc> > +<longdesc lang="en"> > +The IP address of the node to contact via ssh in case it differs from > hostname to perform checks regarding crashdump and NMI configuration. > +</longdesc> > +</parameter> > + > +<parameter name="sshcheck" unique="0"> > +<content type="string" default="false"/> > +<shortdesc lang="en"> > +Checks whether node is configured for crashdump. > +</shortdesc> > +<longdesc lang="en"> > +This will be done via ssh and requires a password-less ssh connection. > +Enable Crashdump Checks. (true|false) > +</longdesc> > +</parameter> > + > </parameters> > IPMIXML > exit 0 > > > > begin:vcard > fn:Tobias Dominik Oestreicher > n:Oestreicher;Tobias Dominik > org:B1-Systems > adr;quoted-printable:;;Osterfeldstra=C3=9Fe 7;Vohburg;;85088;Deutschland > email;internet:[email protected] > title:Linux / Unix Consultant & Trainer > tel;cell:+49 160 53 299 35 > url:http://www.b1-systems.de > version:2.1 > end:vcard > > _______________________________________________________ > Linux-HA-Dev: [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
