Re: repeated failure to properly shutdown
On Jul 22, 2016, at 5:02 PM, Robert Elz wrote: > the question is just where is that first attempt. Hmm, it looks like doing "shutdown now" to get into single-user will force-unmount the tmpfs file systems (/etc/rc.d/swap1), so you could be left in a state where creating a regular /dev/null becomes all too easy. As an aside, while testing out things I also noticed that MAKEDEV tries to use $MKNOD in create_mfs_dev(), but this is only set if the -m switch is supplied. So the attempt to make a temporary console fails. Thanks, - Michael
daily CVS update output
Updating src tree: P src/etc/rc.d/mountcritlocal P src/sys/arch/amd64/amd64/machdep.c P src/sys/arch/amd64/include/pmap.h P src/sys/arch/mips/mips/bus_dma.c P src/sys/arch/x86/x86/pmap.c P src/sys/dev/ata/wd.c P src/sys/dev/ata/wdvar.h P src/sys/net/if.c P src/sys/net/if.h Updating xsrc tree: Killing core files: Updating tar files: src/top-level: collecting... replacing... done src/bin: collecting... replacing... done src/common: collecting... replacing... done src/compat: collecting... replacing... done src/crypto: collecting... replacing... done src/dist: collecting... replacing... done src/distrib: collecting... replacing... done src/doc: collecting... replacing... done src/etc: collecting... replacing... done src/external: collecting... replacing... done src/extsrc: collecting... replacing... done src/games: collecting... replacing... done src/gnu: collecting... replacing... done src/include: collecting... replacing... done src/lib: collecting... replacing... done src/libexec: collecting... replacing... done src/regress: collecting... replacing... done src/rescue: collecting... replacing... done src/sbin: collecting... replacing... done src/share: collecting... replacing... done src/sys: collecting... replacing... done src/tests: collecting... replacing... done src/tools: collecting... replacing... done src/usr.bin: collecting... replacing... done src/usr.sbin: collecting... replacing... done src/config: collecting... replacing... done src: collecting... replacing... done xsrc/top-level: collecting... replacing... done xsrc/external: collecting... replacing... done xsrc/local: collecting... replacing... done xsrc: collecting... replacing... done Running the SUP scanner: SUP Scan for current starting at Sat Jul 23 03:08:23 2016 SUP Scan for current completed at Sat Jul 23 03:08:38 2016 SUP Scan for mirror starting at Sat Jul 23 03:08:38 2016 SUP Scan for mirror completed at Sat Jul 23 03:10:47 2016 Updating release-6 src tree (netbsd-6): U doc/CHANGES-6.2 P libexec/mail.local/mail.local.c Updating release-6 xsrc tree (netbsd-6): Updating release-6 tar files: src/top-level: collecting... replacing... done src/bin: collecting... replacing... done src/common: collecting... replacing... done src/compat: collecting... replacing... done src/crypto: collecting... replacing... done src/dist: collecting... replacing... done src/distrib: collecting... replacing... done src/doc: collecting... replacing... done src/etc: collecting... replacing... done src/external: collecting... replacing... done src/extsrc: collecting... replacing... done src/games: collecting... replacing... done src/gnu: collecting... replacing... done src/include: collecting... replacing... done src/lib: collecting... replacing... done src/libexec: collecting... replacing... done src/regress: collecting... replacing... done src/rescue: collecting... replacing... done src/sbin: collecting... replacing... done src/share: collecting... replacing... done src/sys: collecting... replacing... done src/tests: collecting... replacing... done src/tools: collecting... replacing... done src/usr.bin: collecting... replacing... done src/usr.sbin: collecting... replacing... done src/config: collecting... replacing... done src/x11: collecting... replacing... done xsrc/top-level: collecting... replacing... done xsrc/external: collecting... replacing... done xsrc/local: collecting... replacing... done xsrc/xfree: collecting... replacing... done Running the SUP scanner: SUP Scan for release-6 starting at Sat Jul 23 03:15:49 2016 SUP Scan for release-6 completed at Sat Jul 23 03:15:58 2016 Updating release-7 src tree (netbsd-7): U doc/CHANGES-7.1 P libexec/mail.local/mail.local.c Updating release-7 xsrc tree (netbsd-7): Updating release-7 tar files: src/top-level: collecting... replacing... done src/bin: collecting... replacing... done src/common: collecting... replacing... done src/compat: collecting... replacing... done src/crypto: collecting... replacing... done src/dist: collecting... replacing... done src/distrib: collecting... replacing... done src/doc: collecting... replacing... done src/etc: collecting... replacing... done src/external: collecting... replacing... done src/extsrc: collecting... replacing... done src/games: collecting... replacing... done src/gnu: collecting... replacing... done src/include: collecting... replacing... done src/lib: collecting... replacing... done src/libexec: collecting... replacing... done src/regress: collecting... replacing... done src/rescue: collecting... replacing... done src/sbin: collecting... replacing... done src/share: collecting... replacing... done src/sys: collecting... replacing... done src/tests: collecting... replacing... done src/tools: collecting... replacing... done src/usr.bin: collecting... replacing... done src/usr.sbin: collecting... replacing... done src/config: collecting... replacing... done src/x11: collecting... replacing... done src: collecting... replacing... done
Re: repeated failure to properly shutdown
Date:Fri, 22 Jul 2016 17:09:30 -0700 From:bchMessage-ID: | Iirc, where I *noticed* it was /etc/defaults/rc.d Yes, that (/etc/defaults/rc.conf I assume you mean) writes to /dev/null - but init has made the tmpfs /dev (if it needs it) before it runs /etc/rc - and so before /etc/defaults/rc.conf gets used. There is essentially nothing possible from when the system boots before the tmpfs /dev/ is made - when /dev/console is not there. MAKEDEV is just about the first thing init does in that case - which is why I initially assumed that the problem had to be there (but MAKEDEV does literally nothing to /dev/null except mknod it when needed.) Given that I see two possibilities, and maybe you can remember which is more likely? Either /dev/null (the file in /dev on the root filesys) got created before you initially booted the system, or it was created while you were fixing the missing /dev/console just recently. First, at any time did you have your new root filesystem mounted somewhere, and chroot /to/it (that is if it were on /mnt and you did "chroot /mnt") ? There was no need to do that to fix the missing /dev/console (and the missing rest of /dev) and it was not what Martin suggested you do, so I am going to guess that this did not happen today/yesterday when you were fixing things. Sound right? So, think back to when you first built the system. Sometime then you would have needed to do some configuration - did you boot first, and then configure (stuff like the hostname, the network config, rc_configured=YES in rc.conf etc) or did you set some of that up before you booted? (It doesn't matter here if it was the very first boot or not, just if you did setup only while running the new system, or if you did some config from the system you used to run build.sh). If you did config using the older system - how did that happen? Do you just "cd /new-root/etc; edit; edit; edit ..." or did you "chroot /new-root" ... ? kre
Re: repeated failure to properly shutdown
Date:Fri, 22 Jul 2016 16:27:19 -0700 From:bchMessage-ID: | It could be that for some reason it's missing, and a first attempt to write | to it just creates a regular file... Yes, that's a given - the question is just where is that first attempt. It has to be before init makes the tmpfs /dev (the /dev/null created in the tmpfs would have been the right thing, or you would have noticed that problem much sooner) and it has to be with the live system as root (not when it is mounted on /mnt) as nothing is likely to accidentally write to /mnt/dev/null ... A chroot to /mnt might do it I suppose... Michael Plass said: | Could it perhaps come from the ( set -o tabcomplete 2>/dev/null ) in /etc/ | shrc? Not in normal operation, /etc/shrc wouldn't normally be able to run until way after init has created the tmpfs /dev - init doesn't set ENV, so the sh it runs to execute MAKEDEV wouldn't source shrc - ENV set to /etc/shrc normally comes from /root/.profile which would be used only when root logs in. kre
Re: repeated failure to properly shutdown
On Jul 22, 2016, at 4:24 PM, Robert Elz wrote: >Date:Sat, 23 Jul 2016 04:38:42 +0700 >From:Robert Elz>Message-ID: <20406.1469223...@andromeda.noi.kre.to> > > | That /dev/null turned into a regular file is another bug [...] > | (This turns out to be a bug in MAKEDEV [...] > > Actually, not, it must be elsewhere, or as a result of something else. > > kre > > > Could it perhaps come from the ( set -o tabcomplete 2>/dev/null ) in /etc/shrc? - Michael
Re: repeated failure to properly shutdown
Date:Sat, 23 Jul 2016 04:38:42 +0700 From:Robert ElzMessage-ID: <20406.1469223...@andromeda.noi.kre.to> | That /dev/null turned into a regular file is another bug [...] | (This turns out to be a bug in MAKEDEV [...] Actually, not, it must be elsewhere, or as a result of something else. kre
Re: repeated failure to properly shutdown
On Sat 23 Jul 2016 at 04:38:42 +0700, Robert Elz wrote: > That /dev/null turned into a regular file is another bug - it is being > created before the tmpfs /dev is made, I have seen that before as well, > but just corrected and ignored the problem until now. Similarly, I noticed that if /var is a tmpfs (or any initially empty directory really), then /etc/rc.d/mountcritlocal fails because it wants to cd to /var/run and that has not been created (if that ever happens). -Olaf. -- ___ Olaf 'Rhialto' Seibert -- Wayland: Those who don't understand X \X/ rhialto/at/xs4all.nl-- are condemned to reinvent it. Poorly. signature.asc Description: PGP signature
Re: repeated failure to properly shutdown
Date:Fri, 22 Jul 2016 12:52:58 -0700 From:bchMessage-ID: | I think that biggest concern (unclean shutdown/reboot) is solved (collision | of /dev and a tmpfs mount, caused by default behavior of init in face of | missing /dev/console). Yes, and now we know what the cause is, we should be able to duplicate the problem, and work out what is really happening. The system is supposed to work with a tmpfs /dev, it should not panic during shutdown. What's more, this panic is probably not related to it being /dev - any tmpfs mounted with -o union on a mount point that is using WAPBL (-o log) will probably panic the same way. | This disk was prepared remotely (I.e. from another running NetBSD box) by | partitioning the disk (disklabel), formatting (newfs), then mounting all | partitions appropriately under /mnt and running ./build.sh ... install=/mnt That builds enough for the system to install, but it does not make a fully runnable system - there's more that sysinst normally does (like populating /dev - but also making a basic rc.conf (including all network config, setting hostname) and fstab, that don't get built by build.sh either (and nor should they). Those you must have done later. Running MAKEDEV is just another of the steps that one needs to perform. That /dev/null turned into a regular file is another bug - it is being created before the tmpfs /dev is made, I have seen that before as well, but just corrected and ignored the problem until now. (This turns out to be a bug in MAKEDEV which is run by init to make the tmpfs /dev when /dev/console is not present.) Your solution to that was correct. (MAKEDEV did not fix it as it never replaces anything that already exists, only makes what does not - even if what exists is nonsense, and even if it created the nonsense itself.) It is good that your problems are overcome now - and thanks for bring it to our attention, and for being willing to suffer through getting enough info to allow the problem to be better understood. kre
Re: repeated failure to properly shutdown
On Fri, Jul 22, 2016 at 11:37:56AM -0700, bch wrote: > How does that happen, how does one fix it ? It is created by init if there is no /dev/console. Boot some install media, mount your root file system (say on /mnt) then: cd /mnt/dev sh MAKEDEV all (hoping there is a MAKEDEV script there, if not: extract it from etc.tgz from the install sets) Then reboot and check mount again. Martin
Re: repeated failure to properly shutdown
Wow -- there -is- a tmpfs on /dev kamloops# mount /dev/wd0a on / type ffs (log, local) -> tmpfs on /dev type tmpfs (union, local) /dev/wd0e on /var type ffs (log, local) /dev/wd0f on /usr type ffs (log, local) /dev/wd0g on /home type ffs (log, local) kernfs on /kern type kernfs (local) ptyfs on /dev/pts type ptyfs (local) procfs on /proc type procfs (local) tmpfs on /var/shm type tmpfs (local) But no entry for it in fstab... # NetBSD /etc/fstab # See /usr/share/examples/fstab/ for more examples. /dev/wd0a / ffs rw,log 1 1 /dev/wd0b noneswapsw,dp0 0 /dev/wd0f /usrffs rw,log 1 2 /dev/wd0e /varffs rw,log 1 2 /dev/wd0g /home ffs rw,log 1 2 kernfs /kern kernfs rw ptyfs /dev/ptsptyfs rw procfs /proc procfs rw /dev/cd0a /cdrom cd9660 ro,noauto tmpfs /var/shmtmpfs rw,-m1777,-sram%25 How does that happen, how does one fix it ? On 7/22/16, Ian D. Lerouxwrote: > > > On Fri, Jul 22, 2016, at 14:00, Robert Elz wrote: >> Date:Fri, 22 Jul 2016 07:11:50 -0400 From:"Ian D. >> Leroux" Message-ID: >> <20160722071150.5248712b562feea8d5c89...@fastmail.fm> >> >> | Might this be a good moment to test them out and commit them? >> >> Perhaps, but not really as a fix for the current problem -- we already >> know, from what we have been told, that not doing the tmpfs umount >> avoids the crash ... what I, at least, would like to find is why the >> crash happens at all, rather than just work around it. > > Fair enough. > >> That won't make umounting a tmpfs /dev any more rational to do though >> (but just a tmpfs that happens to contain a device node is perhaps not >> the right test for what to avoid, and manual specification when that >> fails to DTRT isn't a great alternative.) > > I'm not sure there *is* a truly correct test for what to avoid, given > the nature of what's being done at swapoff, but there may well be better > heuristics. I don't want to derail this thread though, so we can take > that up separately at a later date. > > Good luck fixing the crash! > > -- IDL >
panic after 6.x -> 7.x upgrade
Hello, I've upgraded a server from 6.x to 7.x and it became unstable. I first did upgrade the kernel (7.0_STABLE from some time ago), keeping the 6.x userland, and it did run for more than 24h without troubles. Then I did upgrade the userland and problems started. Some filesystems are plain ffs, /usr and /var are ffs+wapbl. /tmp is mfs (not tmpfs because I have quotas here). First, after userland upgrade, it didn't reboot (a reboot did kill processes, but then noting happended). I could enter ddb from here and type 'reboot' but the disks didn't get flushed. I didn't investigate from ddb, unfortunably. After reboot and fsck I got, while going multiuser: err panic: kernel diagnostic assertion "(*vpp)->v_type == VNON" failed: file "/h ome/bouyer/src-7/src/sys/ufs/ffs/ffs_alloc.c", line 615 cpu5: Begin traceback... vpanic() at netbsd:vpanic+0x13c kern_assert() at netbsd:kern_assert+0x4f ffs_valloc() at netbsd:ffs_valloc+0x8b4 ufs_makeinode() at netbsd:ufs_makeinode+0x5e ufs_create() at netbsd:ufs_create+0x5b VOP_CREATE() at netbsd:VOP_CREATE+0x3d vn_open() at netbsd:vn_open+0x3WA2R9^MNI NdoG:_ oSpPenL (N)O aTt LOneWtERbEsdD: dONo_ oSpYeSnC+AL0Lx1 111 4 0d EoX_IsTys _f4o4pe0nf5at1(0 )7 a^Mt netbsd:do_sys_openat+0x68 sys_open() at netbsd:sys_open+0x24 syscall() at netbsd:syscall+0x9a --- syscall (number 5) --- 7f7ff643c40a: cpu5: End traceback... no core dump unfortunably (paniced a second time in wddump). I did force a fsck on log filesystems. The system came up multiuser and ran for about 8 hours, then: panic: wapbl_register_deallocation: out of resources cpu1: Begin tracebackW.A.R.^MNI NvpG:an SicPL( ) NaOTt LneOtWbEsREd:D vOpNa nSiYcS+C0Ax1L3Lc ^M0 0s npErXIinTt ff7()be 4a0t0 n0e 7tb^Ms d:snprintf wapbl_register_inode() at netbsd:wapbl_register_inode ffs_indirtrunc() at netbsd:ffs_indirtrunc+0x3df ffs_truncate() at netbsd:ffs_truncate+0xc43 ufs_direnter() at netbsd:ufs_direnter+0x545 ufs_makeinode() at netbsd:ufs_makeinode+0x2c3 ufs_create() at netbsd:ufs_create+0x5b VOP_CREATE() at netbsd:VOP_CREATE+0x3d vn_open() at netbsd:vn_open+0x329 do_open() at netbsd:do_open+0x111 do_sys_openat() at netbsd:do_sys_openat+0x68 sys_open() at netbsd:sys_open+0x24 syscall() at netbsd:syscall+0x9a --- syscall (number 5) --- 7f7ff583c40a: cpu1: End traceback... again no core dump (this time: insufficient space 8806272 < 9472135) the server would then panic again with the same backtrace while going multiuser (and this time I got a code dump). So I disabled log on all filesystems, and it has been stable since then. Does it ring a bell ? -- Manuel BouyerNetBSD: 26 ans d'experience feront toujours la difference --
Re: repeated failure to properly shutdown
On Fri, Jul 22, 2016, at 14:00, Robert Elz wrote: > Date:Fri, 22 Jul 2016 07:11:50 -0400 From:"Ian D. > Leroux"Message-ID: > <20160722071150.5248712b562feea8d5c89...@fastmail.fm> > > | Might this be a good moment to test them out and commit them? > > Perhaps, but not really as a fix for the current problem -- we already > know, from what we have been told, that not doing the tmpfs umount > avoids the crash ... what I, at least, would like to find is why the > crash happens at all, rather than just work around it. Fair enough. > That won't make umounting a tmpfs /dev any more rational to do though > (but just a tmpfs that happens to contain a device node is perhaps not > the right test for what to avoid, and manual specification when that > fails to DTRT isn't a great alternative.) I'm not sure there *is* a truly correct test for what to avoid, given the nature of what's being done at swapoff, but there may well be better heuristics. I don't want to derail this thread though, so we can take that up separately at a later date. Good luck fixing the crash! -- IDL
Re: repeated failure to properly shutdown
Date:Fri, 22 Jul 2016 07:11:50 -0400 From:"Ian D. Leroux"Message-ID: <20160722071150.5248712b562feea8d5c89...@fastmail.fm> | Might this be a good moment to test them out and commit them? Perhaps, but not really as a fix for the current problem -- we already know, from what we have been told, that not doing the tmpfs umount avoids the crash ... what I, at least, would like to find is why the crash happens at all, rather than just work around it. If it turns out that the tmpfs being umounted is /dev (and not /var/shm which is the only tmpfs in Brad's fstab - but which is unlikely to have any union mounts anywhere near it - and not something else that is being mounted some other way) then we should be able to reproduce the exact environment, and work out why the crash happens, and then fix it. That won't make umounting a tmpfs /dev any more rational to do though (but just a tmpfs that happens to contain a device node is perhaps not the right test for what to avoid, and manual specification when that fails to DTRT isn't a great alternative.) kre
Re: repeated failure to properly shutdown
Date:Fri, 22 Jul 2016 00:33:01 -0700 From:bchMessage-ID: | Confirm this stack frame is the/a one we care about? It looks right, yes, though one level further up should give the same results (vp is passed in as a param, so the same one exists up one level.) kre
Re: repeated failure to properly shutdown
On Fri, 22 Jul 2016 16:57:08 +0700 Robert Elzwrote: > "J. Hannken-Illjes" said: > > | No populated "/dev" so it uses dev on tmpfs? > > Ah yes, very possible - the output from mount will tell us that, but I > remember earlier reports of problems from unmounting (or attempting > to) an unmount of /dev (and hardly surprising really.) Tangentially, I wrote (some of) those earlier reports of problems with unmounting /dev at shutdown and my patches to fix the prolblem are still sitting uncommitted in bin/51019. Might this be a good moment to test them out and commit them? If it doesn't fix OP's problem, it would at least rule out one cause. -- IDL
Re: repeated failure to properly shutdown
I can do that tomorrow, yes. Confirm this stack frame is the/a one we care about? Regards, -bch On Jul 21, 2016 11:42 PM, "Martin Husemann"wrote: On Thu, Jul 21, 2016 at 04:38:57PM -0700, bch wrote: > and the v_mount refcounts and flags are: > > (gdb) print vp->v_mount > $2 = (struct mount *) 0xfe81081c2008 > (gdb) print vp->v_mount->mnt_refcnt > $3 = 2501 > (gdb) print vp->v_mount->mnt_flag > $4 = 4128 > (gdb) can you also show print *vp print *vp->v_mount please? Thanks, Martin
Re: repeated failure to properly shutdown
Date:Fri, 22 Jul 2016 08:45:44 + From:co...@sdf.org Message-ID: <20160722084544.ga14...@sdf.org> | probably good to remember that it's also saying it's double freed. | is it garbage data because it was freed before? Perhaps, we will get a better idea when we see the full struct mount that Martin requested. But I doubt that was garbage, it is too "reasonable" a value, for some definition of reasonable. "J. Hannken-Illjes"said: | No populated "/dev" so it uses dev on tmpfs? Ah yes, very possible - the output from mount will tell us that, but I remember earlier reports of problems from unmounting (or attempting to) an unmount of /dev (and hardly surprising really.) At least it looks as if we might be getting closer to an understanding of the setup that causes the problem so it can be duplicated, and debugged quicker that once a day turnaround of these e-mail messages... kre
Re: repeated failure to properly shutdown
> On 22 Jul 2016, at 10:39, Robert Elzwrote: > >Date:Thu, 21 Jul 2016 16:38:57 -0700 >From:bch >Message-ID: >
Re: repeated failure to properly shutdown
On Fri, Jul 22, 2016 at 03:39:26PM +0700, Robert Elz wrote: > Date:Thu, 21 Jul 2016 16:38:57 -0700 > From:bch> Message-ID: >
Re: repeated failure to properly shutdown
Date:Thu, 21 Jul 2016 16:38:57 -0700 From:bchMessage-ID: