On Friday 21 May 2010 00:03, Paul Smith wrote: > Hi all. I've just finished tracking down a weird problem on our system. > I can fix the "top-level" issue, but I'm trying to determine whether > there are any real bugs here (besides my error) or not. > > We're using an intel-based embedded system running in a ramdisk with a > 2.6.32.13 kernel, glibc, and busybox 1.15.3. On this system we're using > busybox syslogd and klogd. We noticed that we were getting corrupted > kernel messages in the syslog file output. dmesg showed correct values > but the kernel messages kept in /var/log/messages were all messed up. > > Tracking this down it appears the cause of the problem is that there are > two klogd processes running on my system. So, my first question is, is > this expected that if two klogd are running that they would "step on" > each others' access to the kernel log ring buffer,
Unfortunately, yes. That's how it works. > or something, and > generate corrupted output? I would have naively expected each process > to maintain its own pointer and I might get duplicated messages, but not > corrupted messages. If I run two "dmesg" commands at the same time I > don't see this kind of corruption. Because dmesg does not need to sit and wait for a new protion of data. It just dumps entire buffer. This opration does not require per-process pointers to buffer. > Or, is this a bug in busybox klogd? > Or is it a bug in Linux 2.6.32.13? Neither. It worked that way for years. > Trying to determine why I have two klogd processes I see this: the > bring-up of the system is managed by busybox init. To start the system > it runs: > > ::sysinit:/etc/init.d/rcS start > ::once:/etc/rootfs start > > In the "rootfs" script is where klogd is started, then it does some > things such as change the system time, timezone, hostname, etc. The way > things work we restart the logging system multiple times as these > changes are implemented, so the log messages have the right timezone and > hostname. > > The first time I restart I kill klogd, but init (apparently) never reaps > the child so it's running as a zombie. Please show me the process *tree*. In this tree, is this zombie a child of init, or a child of some other process? If former, then it's a bug in init. Try newer busybox. If it doesn't help, let me know. > Then I start a new klogd. Then > more booting happens and I want to restart it again, but when I use > pidof to find the pid of klogd I get back TWO pids: the zombie and the > real klogd. Due to an error in my script where I don't expect to get > back >1 PID, the kill fails and I start another klogd. Now I have one > zombie and two real klogd's, so that's where this corruption comes from. Use "killall klogd", it would kill *all* klogds. > At some later point (once my "rootfs" script ends?), init will reap the > zombie klogd and I'll be left with just the two klogd processes. > > I can fix my scripting, of course, but that brings me to the next > question: is it expected/desirable that init doesn't reap children while > it's waiting for a command like this to complete? This seems like > sub-optimal behavior. init must reap zombies. But it can only reap _its children_. Example: $ sh -c 'sleep 1 & exec sleep 999' In other xterm: $ top -bn1 | grep -e PID -e sleep PID PPID USER STAT VSZ %MEM CPU %CPU COMMAND 25808 25787 root S 920 0% 2 0% grep -e PID -e sleep 25781 1224 root S 916 0% 0 0% sleep 1260 3599 1 root S 916 0% 0 0% sleep 32000 25805 25784 root S 916 0% 1 0% sleep 999 25806 25805 root Z 0 0% 1 0% [sleep] See? Process 25806 died, but its parent process 25805 did not reap it. So 25806 will stay in state Z until process 25805 reaps it, or until process 25805 exits. In latter case, 25806 will be reparented to init, and only _THEN_ init can reap it. You need to check PPID of zombies on your system. -- vda _______________________________________________ busybox mailing list [email protected] http://lists.busybox.net/mailman/listinfo/busybox
