retitle 535130 race causes failure to kill udevd in initramfs script with set 
-e, cascading problems ensue
reassign 535130 udev
affects 535130 + linux-2.6
thanks

I reproduce this problem every few boots lately.  Diagnosis below, which
seems to point to udev; reassigning accordingly.  Note that this matches
the problem described in the followups from Gijs Hillenius, but not
necessarily the original report from Simon Richter, which may thus need
a separate bug report.

Package versions:
linux-image-2.6.31-trunk-amd64 version 2.6.31-1~experimental.2
udev version 146-5

Every few times I boot, X comes up but has no keyboard or mouse, forcing
me to use the power button to shut down.

After noticing a piece of some unusual boot messages that only occurred
when the problem reproduced, I started using Ctrl-S and Ctrl-Q to pause
and resume boot messages, so I could see what the messages indicated.

First, before the message about init starting, I saw a message from the
kill command about "no such process" for some PID.  Given that this
message appeared before init started, it likely came from the initramfs.
I looked in the initramfs, and I only see one place that calls the kill
command, from scripts/init-bottom/udev:

> # Stop udevd, we'll miss a few events while we run init, but we catch up
> for proc in /proc/[0-9]*; do
>     [ -x $proc/exe ] || continue
>     [ "$(readlink $proc/exe)" != /sbin/udevd ] || kill ${proc#/proc/}
> done

This script has "#!/bin/sh -e", so this failure in kill would cause it to exit
and not run the remaining commands in the file:

> # move the /dev tmpfs to the rootfs
> mount -n -o move /dev $rootmnt/dev
> 
> # create a temporary symlink to the final /dev for other initramfs scripts
> nuke /dev
> ln -s $rootmnt/dev /dev

Slightly later in the boot process, I saw something about mknod failing to
create /dev/ppp, due to a read-only filesystem.  That seems plausible if the
/dev tmpfs didn't get moved to the root filesystem.  I would *guess* that the
attempted creation of /dev/ppp occurs in /etc/init.d/udev, when it calls
/lib/udev/create_static_nodes, which uses /etc/udev/links.conf, which has an
entry for /dev/ppp; create_static_nodes only creates things that don't already
exist, and /dev/ppp is the first item from links.conf that doesn't already
exist in the static /dev directory on the root filesystem.  (To look underneath
udev's /dev, I used a simple program that ran unshare(CLONE_NEWNS) and a shell,
and used mount to move /dev to /mnt in that temporary mount namespace.)

Now, since /etc/init.d/udev also has "#!/bin/sh -e", the failure of
/lib/udev/create_static_nodes would cause it to exit, and thus not actually
start udev, which could easily lead to havoc such as X not seeing a keyboard
and mouse.

Consistent with this, I saw one more set of unusual messages, namely hwclock
complaining about the inability to talk to the hardware clock, twice.
hwclock's two init scripts both should not run if they detect udev.

So, it sounds like the original failure comes from the failed kill of
udevd in an initramfs script with set -e, and everything cascades from
there.  We just need to figure out what happened when attempting to kill
udevd.  That failure seems to require a race, since it occurs only after
checking if the process's exe links to /sbin/udevd.

- Josh Triplett



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to