daily CVS update output
Updating src tree: P src/external/cddl/osnet/dist/lib/libdtrace/common/dt_module.c P src/share/man/man9/module.9 P src/share/man/man9/pci.9 P src/sys/arch/evbarm/conf/RPI P src/sys/arch/sparc/sparc/autoconf.c P src/sys/arch/sparc/sparc/db_disasm.c P src/sys/arch/sparc/sparc/db_interface.c P src/sys/arch/sparc/sparc/msiiep.c P src/sys/arch/sparc/sparc/pmap.c P src/sys/arch/sparc/sparc/syscall.c P src/sys/arch/sparc/sparc/trap.c P src/sys/arch/sparc64/sparc64/db_disasm.c P src/sys/arch/x86/x86/cpu_ucode_intel.c P src/sys/dev/gpio/gpiobutton.c P src/sys/dev/sbus/stp4020.c P src/tests/usr.bin/config/t_config.sh Updating xsrc tree: Killing core files: Running the SUP scanner: SUP Scan for current starting at Mon Oct 5 03:06:02 2015 SUP Scan for current completed at Mon Oct 5 03:06:57 2015 SUP Scan for mirror starting at Mon Oct 5 03:06:57 2015 SUP Scan for mirror completed at Mon Oct 5 03:09:44 2015 Updating release-5 src tree (netbsd-5): Updating release-5 xsrc tree (netbsd-5): Running the SUP scanner: SUP Scan for release-5 starting at Mon Oct 5 03:16:09 2015 SUP Scan for release-5 completed at Mon Oct 5 03:16:17 2015 Updating release-6 src tree (netbsd-6): Updating release-6 xsrc tree (netbsd-6): Running the SUP scanner: SUP Scan for release-6 starting at Mon Oct 5 03:33:18 2015 SUP Scan for release-6 completed at Mon Oct 5 03:33:29 2015 Updating file list: -rw-rw-r-- 1 srcmastr netbsd 52957159 Oct 5 03:40 ls-lRA.gz
Re: Problems with gdb?
Date:Sun, 04 Oct 2015 18:03:26 +0700 From:Robert ElzMessage-ID: <10734.1443956...@andromeda.noi.kre.to> | Paul's problem is with the image file (the core) - or more likely, with | gdb (since crash(8) works). Ignore me (aside from the part about copying /netbsd to /var/crash being broken) I had not seen Paul's later message when I replied... kre
Re: Killing a zombie process?
On Sun, 4 Oct 2015, Robert Elz wrote: Date:Sun, 4 Oct 2015 17:25:21 +0800 (PHT) From:Paul GoyetteMessage-ID: | I'm pretty much convinced that the p_nstopchild accounting is screwed up | somewhere. I think I agree. | I'm planning on adding the following code in "optimization" | in kern_exit so I can catch it as soon as it happens. Sooner, but unfortunately, most probably not soon enough. It is most likely some locking/race condition with multiple processes dying at the same time (approximately) that is causing some of the increments to be lost. Making them all use atomic ops, instead of just ++ might fix the problem, at the cost of never discovering where issue actually occurs - there should be locks around all manipulations of this stuff, possibly one of them is missing or misplaced. Yeah, I think that there's a basic accounting problem somewhere, and with an extreme load it is more likely for the SSTOPed process to get inserted in the p_children/p_sibling list before the SZOMB process can get reaped. Once the SSTOPed process gets to front-of line (with the parent's p_nstopchild count zero), the SZOMB process won't ever get processed. My patch will simply validate this theory. (BTW, the patch is actually wrong, as it would also panic in the case where the wait was for a specific pid. I've modified it in my new kernel - not yet tested.) It is unlikely to be in the wait processing (at least not this one) as there's just one process doing the waiting, there would be no contention for the accesses here (it could be a combination of the two though, wait() happening at the same instant a process is dying). See above. I'm also puzzled by your observations of forked init processes having exited - after rc is finished, init generally only forks when one of the console/terminal sessions ends, and a new getty needs to be started. On most modern systems, that's a very rare event - though if you use the console (ctl-alt-Fn or whatever it is) switching, and login and out of those (virtual) terminals, it would happen. Is there anything like that in your environment? I do occassionally switch to another wsdisplay screen (away from the X one), but not frequently. I definitely do a switch before I use Ctrl/Alt/Esc to get into ddb. I'm wondering if some (most? all?) of the SSTOPd processes I see are a result of entering ddb and/or triggering the reboot? Doesn't ddb need to stop whatever is running on "the other CPU cores" ? +--+--+-+ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | +--+--+-+
daily CVS update output
Updating src tree: P src/libexec/lfs_cleanerd/lfs_cleanerd.c P src/sbin/fsck_lfs/extern.h P src/sbin/fsck_lfs/fsck.h P src/sbin/fsck_lfs/lfs.c P src/sbin/fsck_lfs/pass1.c P src/sbin/fsck_lfs/pass6.c P src/sbin/fsck_lfs/segwrite.c P src/sbin/fsck_lfs/setup.c P src/sys/dev/pci/pci_subr.c P src/sys/ufs/lfs/lfs.h P src/sys/ufs/lfs/lfs_accessors.h P src/sys/ufs/lfs/lfs_bio.c P src/sys/ufs/lfs/lfs_rfw.c P src/sys/ufs/lfs/lfs_segment.c P src/sys/ufs/lfs/lfs_subr.c P src/usr.sbin/dumplfs/dumplfs.c Updating xsrc tree: Killing core files: Running the SUP scanner: SUP Scan for current starting at Sun Oct 4 04:09:11 2015 SUP Scan for current completed at Sun Oct 4 04:43:32 2015 SUP Scan for mirror starting at Sun Oct 4 04:43:32 2015 SUP Scan for mirror completed at Sun Oct 4 06:11:25 2015 Updating file list: -rw-rw-r-- 1 srcmastr netbsd 53007056 Oct 4 09:24 ls-lRA.gz
Re: Killing a zombie process?
I'm pretty much convinced that the p_nstopchild accounting is screwed up somewhere. I'm planning on adding the following code in "optimization" in kern_exit so I can catch it as soon as it happens. Basically, if the optimization would cause us to stop looking for a process to report, this hack/patch will just scan the rest of the sibling list. If it finds a zombie that should be reported, it will panic, and I'll have pointers to both the zombie and the process at which the optimization occurred. Comments? Index: kern_exit.c === RCS file: /cvsroot/src/sys/kern/kern_exit.c,v retrieving revision 1.245 diff -u -p -r1.245 kern_exit.c --- kern_exit.c 2 Oct 2015 16:54:15 - 1.245 +++ kern_exit.c 4 Oct 2015 09:15:00 - @@ -788,6 +788,14 @@ find_stopped_child(struct proc *parent, break; } if (parent->p_nstopchild == 0 || child->p_pid == pid) { +/* XXX */ + struct proc *nxtchild = child; + while (nxtchild = LIST_NEXT(nxtchild, p_sibling) + if (nxtchild->p_stat == SZOMB) + panic("Zombie %p not reaped - " + "scan stopped at proc %p", + nxtchild, child); +/* XXX */ child = NULL; break; } +--+--+-+ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | +--+--+-+
Re: Problems with gdb?
Date:Sun, 4 Oct 2015 10:26:10 +0300 From:Andreas GustafssonMessage-ID: <22032.54418.500901.577...@guava.gson.org> | Paul Goyette wrote: | > In attempts to debug another problem (see the thread about "killing | > zombies"), I've twice forced crash dumps from ddb. Once with the 'sync' | > command, and once with 'reboot 0x104'. | [...] | > Yet, gdb fails to process these files: | | PR 48915? No, that one (I believe) relates to the /var/crash/netbsd.N files being a mess - something goes horribly wrong in the way they're created (I had observed that one as well, though I wasn't aware there was a PR about it.) That one is a nuisance bug, as one can always just use the /netbsd file that had been booted when the system crashed instead. Paul's problem is with the image file (the core) - or more likely, with gdb (since crash(8) works). kre
Re: Killing a zombie process?
Date:Sun, 4 Oct 2015 17:25:21 +0800 (PHT) From:Paul GoyetteMessage-ID: | I'm pretty much convinced that the p_nstopchild accounting is screwed up | somewhere. I think I agree. | I'm planning on adding the following code in "optimization" | in kern_exit so I can catch it as soon as it happens. Sooner, but unfortunately, most probably not soon enough. It is most likely some locking/race condition with multiple processes dying at the same time (approximately) that is causing some of the increments to be lost. Making them all use atomic ops, instead of just ++ might fix the problem, at the cost of never discovering where issue actually occurs - there should be locks around all manipulations of this stuff, possibly one of them is missing or misplaced. It is unlikely to be in the wait processing (at least not this one) as there's just one process doing the waiting, there would be no contention for the accesses here (it could be a combination of the two though, wait() happening at the same instant a process is dying). I'm also puzzled by your observations of forked init processes having exited - after rc is finished, init generally only forks when one of the console/terminal sessions ends, and a new getty needs to be started. On most modern systems, that's a very rare event - though if you use the console (ctl-alt-Fn or whatever it is) switching, and login and out of those (virtual) terminals, it would happen. Is there anything like that in your environment? kre
Re: Killing a zombie process?
On Sun, 4 Oct 2015, Paul Goyette wrote: | 1. Is it correct for init's p_nstopchild to be zero when it has several | children whose p_state is SSTOP? Depends whether those children have previously been waited for or not. Stopped children don't go away when they're waited for, so there needs to be something to prevent wait() returning the same stopped child over and over again. That's p_waited ... so you need to check that value of the stopped children, if it is 0, then something is broken. If it is 1 (for all of them) then they're irrelevant, and matter not at all. Here's another instance of the problem. (Note that I'm limping along with crash(8) here since gdb isn't cooperating at the moment.) crash> show proc 1 init: pid 1 proc fe810f46ecd0 vmspace/map fe810f483e60 flags 4001 lwp 1 fe810f476a60 pcb fe810f464000 stat 2 flags 802 cpu 0 pri 43 crash> x/x 0xfe810f46ecd0+0x130 fe810f46ee00: 0 p_nstopchild == 0 crash> x/x 0xfe810f46ecd0+0x100,2 fe810f46edd0: 7b5f5800fe80p_children listhead Looking at the first child... crash> x/x 0xfe807b5f5800+0xd0 fe807b5f58d0: 4 p_stat == SSTOP crash> fe807b5f58d4: 6f68p_pid crash> show proc 0x6f68 init: pid 28520 proc fe807b5f5800 vmspace/map fe807e7be480 flags 0 lwp 1 fe811e636300 pcb fe81aae19000 stat 2 flags 802 cpu 3 pri 43 crash> x/x 0xfe807b5f5800+0x134 fe807b5f5934: 0 p_waited == 0 crash> x/x 0xfe807b5f5800+0xf0,2 fe807b5f58f0: f46e520 fe81p_sibling.le_next So, the first child of init appears to be another instance of init, and its state is SSTOP. It has not been waited for, yet its parent (the "real" init, pid=1) has a zero count for p_nstopchild. This problem is easily reproduced, but only under heavy-load conditions. On a amd64 (CPU = Intel i5-4460 @ 3.20GHz) 7.99.21 I've been running a 'build.sh -j3 release' in parallel with a series of pkgsrc builds running with MAKE_JOBS=3; it takes from 30 to 60 minutes of this before the Zombie appears. (The pkgsrc builds are running in chroot created by pkgsrc/sysutils/mksandbox.) +--+--+-+ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | +--+--+-+
Re: Problems with gdb?
On Sun, 4 Oct 2015, Andreas Gustafsson wrote: Paul Goyette wrote: In attempts to debug another problem (see the thread about "killing zombies"), I've twice forced crash dumps from ddb. Once with the 'sync' command, and once with 'reboot 0x104'. [...] Yet, gdb fails to process these files: PR 48915? Yup, looks like that's the one! I can process the dump file successfully with # gdb /netbsd.gdb GNU gdb (GDB) 7.9.1 Copyright (C) 2015 Free Software Foundation, Inc. Reading symbols from /netbsd.gdb...done. (gdb) target kvm netbsd.4.core 0x801196a5 in cpu_reboot (howto=howto@entry=256, bootstr=bootstr@entry=0x0) at /build/netbsd-local/src/sys/arch/amd64/amd64/machdep.c:671 671 dumpsys(); (gdb) +--+--+-+ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee.com| | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd.org | +--+--+-+
Re: Problems with gdb?
Paul Goyette wrote: > In attempts to debug another problem (see the thread about "killing > zombies"), I've twice forced crash dumps from ddb. Once with the 'sync' > command, and once with 'reboot 0x104'. [...] > Yet, gdb fails to process these files: PR 48915? -- Andreas Gustafsson, g...@gson.org
Re: Killing a zombie process?
Date:Sun, 4 Oct 2015 20:52:43 +0800 (PHT) From:Paul GoyetteMessage-ID: | I do occassionally switch to another wsdisplay screen (away from the X | one), but not frequently. I definitely do a switch before I use | Ctrl/Alt/Esc to get into ddb. OK, that could explain the forked init. | I'm wondering if some (most? all?) of the SSTOPd processes I see are a | result of entering ddb and/or triggering the reboot? Doesn't ddb need | to stop whatever is running on "the other CPU cores" ? No, not that kind of stop. kre ps: you might want to try fixing PR 50298 (that I just submitted) and see if that makes a difference - I think the chances are about one in infinity, but ...
Re: pkgsrc-2015Q3 released
On Wed 30 Sep 2015 at 10:29:16 -0400, Greg Troxel wrote: > Basically yes. Howver, you may want to do a final update of the tree > From sourceforge and verify you have no uncommitted changes that you > want to keep. (If so, you will have to manage them manually.) which currently gives errors about "cannot close CVS/Entries" and "No space left on device"... precisely the sort of reasons we moved away from there of course. -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl-- 'this bath is too hot.' pgpoN9Y81jqSp.pgp Description: PGP signature