Re: Bootscripts and error handling

Ivan Kabaivanov Fri, 08 Jul 2011 23:30:34 -0700

On Friday 08 July 2011 17:55:09 Bruce Dubbs wrote:
> I've been working on bootscripts.  Basically, I'm rewriting them to get
> a better understanding.  I may end up throwing them out completely but I
> want to discuss the issue of error handling.
> 
> There are three bootscript files that use the
> 
>    read ENTER
> 
> construct:  checkfs, udev, and functions.



Hi Bruce,

I'll throw in my $0.02, as this read ENTER has been always a thorn in my side 
and I've been patching the lfs-bootscripts for a very long time to get around 
it.

What I do: I have a dedicated partition of 100MB for storing boot failure 
logs.  I pass an extra argument to grub:

rescue-logs=/dev/sda6

however, with a bit of tweaking I can use rescue-logs=LABEL=rescue-logs just 
as well.

Then in /etc/rc.d/init.d/functions I have this:

# This is the partition where logs are saved in case of boot failure
# this partition is never used for anything else, and is never mounted rw
RESCUE_LOGS_PARTITION=none
for i in $(cat /proc/cmdline); do
        case ${i} in
                rescue-logs=*)
                        RESCUE_LOGS_PARTITION=${i#rescue-logs=}
                        ;;
        esac
done


print_error_msg()
{
        echo_failure
        # $i is inherited by the rc script
        boot_mesg -n "FAILURE:\n\nYou should not be reading this error 
message.\n\n" ${FAILURE}
        boot_mesg -n " It means that an unforeseen error took"
        boot_mesg -n " place in ${i}, which exited with a return value of"
        boot_mesg " ${error_value}.\n"
        boot_mesg_flush
        boot_mesg -n "If you're able to track this"
        boot_mesg -n " error down to a bug in one of the files provided by"
        boot_mesg -n " the LFS book, please be so kind to inform us at"
        boot_mesg " [email protected].\n"
        boot_mesg_flush
        boot_mesg -n "\n\nWaiting ${TIMEOUT} seconds..." ${INFO}
        boot_mesg "" ${NORMAL}

        # Now try to save the error into the rescue log
        rescue_logs "Error in ${i}!!! Error value= ${error_value}"
        sleep ${TIMEOUT}

}


rescue_logs() {
        MESSAGE="$@"
        DATE=`date +%Y-%m-%d-%H-%M-%S`
        LOG="/media/rescue-logs/failed-${DATE}.log"
        if [ x"${RESCUE_LOGS_PARTITION}" != x"none" ]; then
                if mount ${RESCUE_LOGS_PARTITION} /media/rescue-logs 2>&1 > 
/dev/null; then
                        echo "=== BOOT FAILURE on ${DATE} ===" > ${LOG}
                        echo "${MESSAGE}" >> ${LOG}
                        echo "=== END OF BOOT FAILURE on ${DATE} ===" >> 
${LOG}
                        echo -e "\n\n\n" >> ${LOG}

                        umount /media/rescue-logs
                fi
        fi
}


And then for example in /etc/rc.d/init.d/udev I have this:

boot_mesg "Populating /dev with device nodes..."
                if ! grep -q '[[:space:]]sysfs' /proc/mounts; then
                        echo_failure
                        boot_mesg -n "FAILURE:\n\nUnable to create" ${FAILURE}
                        boot_mesg -n " devices without a SysFS filesystem"
                        boot_mesg -n "\n\nAfter you press Enter, this system"
                        boot_mesg -n " will be rebooted for repair." ${INFO}
                        boot_mesg "" ${NORMAL}

                        # Now try to save the error into the rescue log
                        rescue_logs "No SysFS filesystem"
                        sleep ${TIMEOUT}
                        reboot -f
                fi


Now, I use a grub trick that I believe Bruce posted to this mailing list a few 
years ago to set a grub env variable  "recordfail" to 1 upon every boot which 
is then cleared in case of a normal boot.  In case of a failed boot, grub 
picks a second entry which is a rescue mode initrd with busybox which tries to 
get a DHCP lease or in case of no DHCP server, it tries to find a free IP on 
the same network.  I had it then email me this temporary IP, but I removed 
this as I had to include my email password in the initrd.  Maybe there's a way 
to encrypt the password, but I didn't look hard enough.  This works even for 
an internet-facing machine and I've successfully tested logging into my 
machine from a different location, as long as my internet connection is 
working.

Finally, I just ssh to this mini os, check the rescue logs, fix the problem, 
reset the default grub entry and reboot.  So far it has worked for me.

As a matter of fact, I created this initrd in response to this email thread:

http://linuxfromscratch.org/pipermail/lfs-dev/2004-January/041720.html

My idea was to have a "self-healing" system as much as, and if at all, 
possible.  An initrd which will try to fix corrupted filesystems, or at least 
provide a way for you to log into the system after a failed boot and allow you 
to troubleshoot and fix problems yourself.  For headless/keayboardless 
machines this is a good thing.

My next crazy idea is relocatable kernel which with some black voodoo magic 
and kexec can be loaded in case of a new kernel failing to load.  Also an 
initrd which boots either from harddisk or from a bootable cdrom/usb 
thumbdrive/usb floppy/etc in case of hard disk failure.  My goal is to ensure 
that I can always reach my system even in case of serious problems (except of 
course loss of power or internet connectivity).

I'm not saying this is the best approach, but I submit it to your attention in 
case you find it, or parts of it, interesting.




> 
> In the case of functions, the construct is used in print_error_msg that
> is only called from the rc script.  It is not a fatal function.
> 
> In checkfs, the construct is called in three different places.  In two
> places it is followed immediately by a halt and one place a reboot.
> 
> In udev, the construct is called in two places. In both cases, it is
> followed by a halt.
> 
> The question is how to handle these errors in a headless or keyboardless
>   system.  The problems identified are pretty serious and it's doubtful
> anything could be written to the disk.
> 
> I'm thinking about moving the messages/halt/reboot to the functions
> script so they all can be handled in one place.   If we then have the
> functions script do:
> 
> [ -e /etc/sysconfig/init_params ]  && . /etc/sysconfig/init_params
> 
> then when we want to optionally stop for the user to read something:
> 
> # Wait for the user by default
> [ "${HEADLESS=0}" = "0" ] && read ENTER


I always replace the read ENTER with sleep 20 (or more if the message is 
long).  And I replace shutdown with reboot which boots into rescue mode.  To 
me a linux server should never make itself unavailable, by either waiting 
infinitely for user input at the console or by shutting itself down.


> 
> To disable the need for a keyboard entry, the /etc/sysconfig/init_params
> file would define the following:
> 
> HEADLESS=1
> 
> --------
> 
> The above would only apply to LFS bootscripts.  I can't think of
> anything from BLFS or a third party that would need to stop the boot
> sequence to wait for the user to read a message.
> 
> Should we integrate this into the LFS bootscripts?
> 
>    -- Bruce


IvanK.
-- 
http://linuxfromscratch.org/mailman/listinfo/lfs-dev
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page

Re: Bootscripts and error handling

Reply via email to