On Friday 08 July 2011 17:55:09 Bruce Dubbs wrote:
> I've been working on bootscripts. Basically, I'm rewriting them to get
> a better understanding. I may end up throwing them out completely but I
> want to discuss the issue of error handling.
>
> There are three bootscript files that use the
>
> read ENTER
>
> construct: checkfs, udev, and functions.
Hi Bruce,
I'll throw in my $0.02, as this read ENTER has been always a thorn in my side
and I've been patching the lfs-bootscripts for a very long time to get around
it.
What I do: I have a dedicated partition of 100MB for storing boot failure
logs. I pass an extra argument to grub:
rescue-logs=/dev/sda6
however, with a bit of tweaking I can use rescue-logs=LABEL=rescue-logs just
as well.
Then in /etc/rc.d/init.d/functions I have this:
# This is the partition where logs are saved in case of boot failure
# this partition is never used for anything else, and is never mounted rw
RESCUE_LOGS_PARTITION=none
for i in $(cat /proc/cmdline); do
case ${i} in
rescue-logs=*)
RESCUE_LOGS_PARTITION=${i#rescue-logs=}
;;
esac
done
print_error_msg()
{
echo_failure
# $i is inherited by the rc script
boot_mesg -n "FAILURE:\n\nYou should not be reading this error
message.\n\n" ${FAILURE}
boot_mesg -n " It means that an unforeseen error took"
boot_mesg -n " place in ${i}, which exited with a return value of"
boot_mesg " ${error_value}.\n"
boot_mesg_flush
boot_mesg -n "If you're able to track this"
boot_mesg -n " error down to a bug in one of the files provided by"
boot_mesg -n " the LFS book, please be so kind to inform us at"
boot_mesg " [email protected].\n"
boot_mesg_flush
boot_mesg -n "\n\nWaiting ${TIMEOUT} seconds..." ${INFO}
boot_mesg "" ${NORMAL}
# Now try to save the error into the rescue log
rescue_logs "Error in ${i}!!! Error value= ${error_value}"
sleep ${TIMEOUT}
}
rescue_logs() {
MESSAGE="$@"
DATE=`date +%Y-%m-%d-%H-%M-%S`
LOG="/media/rescue-logs/failed-${DATE}.log"
if [ x"${RESCUE_LOGS_PARTITION}" != x"none" ]; then
if mount ${RESCUE_LOGS_PARTITION} /media/rescue-logs 2>&1 >
/dev/null; then
echo "=== BOOT FAILURE on ${DATE} ===" > ${LOG}
echo "${MESSAGE}" >> ${LOG}
echo "=== END OF BOOT FAILURE on ${DATE} ===" >>
${LOG}
echo -e "\n\n\n" >> ${LOG}
umount /media/rescue-logs
fi
fi
}
And then for example in /etc/rc.d/init.d/udev I have this:
boot_mesg "Populating /dev with device nodes..."
if ! grep -q '[[:space:]]sysfs' /proc/mounts; then
echo_failure
boot_mesg -n "FAILURE:\n\nUnable to create" ${FAILURE}
boot_mesg -n " devices without a SysFS filesystem"
boot_mesg -n "\n\nAfter you press Enter, this system"
boot_mesg -n " will be rebooted for repair." ${INFO}
boot_mesg "" ${NORMAL}
# Now try to save the error into the rescue log
rescue_logs "No SysFS filesystem"
sleep ${TIMEOUT}
reboot -f
fi
Now, I use a grub trick that I believe Bruce posted to this mailing list a few
years ago to set a grub env variable "recordfail" to 1 upon every boot which
is then cleared in case of a normal boot. In case of a failed boot, grub
picks a second entry which is a rescue mode initrd with busybox which tries to
get a DHCP lease or in case of no DHCP server, it tries to find a free IP on
the same network. I had it then email me this temporary IP, but I removed
this as I had to include my email password in the initrd. Maybe there's a way
to encrypt the password, but I didn't look hard enough. This works even for
an internet-facing machine and I've successfully tested logging into my
machine from a different location, as long as my internet connection is
working.
Finally, I just ssh to this mini os, check the rescue logs, fix the problem,
reset the default grub entry and reboot. So far it has worked for me.
As a matter of fact, I created this initrd in response to this email thread:
http://linuxfromscratch.org/pipermail/lfs-dev/2004-January/041720.html
My idea was to have a "self-healing" system as much as, and if at all,
possible. An initrd which will try to fix corrupted filesystems, or at least
provide a way for you to log into the system after a failed boot and allow you
to troubleshoot and fix problems yourself. For headless/keayboardless
machines this is a good thing.
My next crazy idea is relocatable kernel which with some black voodoo magic
and kexec can be loaded in case of a new kernel failing to load. Also an
initrd which boots either from harddisk or from a bootable cdrom/usb
thumbdrive/usb floppy/etc in case of hard disk failure. My goal is to ensure
that I can always reach my system even in case of serious problems (except of
course loss of power or internet connectivity).
I'm not saying this is the best approach, but I submit it to your attention in
case you find it, or parts of it, interesting.
>
> In the case of functions, the construct is used in print_error_msg that
> is only called from the rc script. It is not a fatal function.
>
> In checkfs, the construct is called in three different places. In two
> places it is followed immediately by a halt and one place a reboot.
>
> In udev, the construct is called in two places. In both cases, it is
> followed by a halt.
>
> The question is how to handle these errors in a headless or keyboardless
> system. The problems identified are pretty serious and it's doubtful
> anything could be written to the disk.
>
> I'm thinking about moving the messages/halt/reboot to the functions
> script so they all can be handled in one place. If we then have the
> functions script do:
>
> [ -e /etc/sysconfig/init_params ] && . /etc/sysconfig/init_params
>
> then when we want to optionally stop for the user to read something:
>
> # Wait for the user by default
> [ "${HEADLESS=0}" = "0" ] && read ENTER
I always replace the read ENTER with sleep 20 (or more if the message is
long). And I replace shutdown with reboot which boots into rescue mode. To
me a linux server should never make itself unavailable, by either waiting
infinitely for user input at the console or by shutting itself down.
>
> To disable the need for a keyboard entry, the /etc/sysconfig/init_params
> file would define the following:
>
> HEADLESS=1
>
> --------
>
> The above would only apply to LFS bootscripts. I can't think of
> anything from BLFS or a third party that would need to stop the boot
> sequence to wait for the user to read a message.
>
> Should we integrate this into the LFS bootscripts?
>
> -- Bruce
IvanK.
--
http://linuxfromscratch.org/mailman/listinfo/lfs-dev
FAQ: http://www.linuxfromscratch.org/faq/
Unsubscribe: See the above information page