Re: repeated failure to properly shutdown

2016-07-22 Thread Michael Plass
On Jul 22, 2016, at 5:02 PM, Robert Elz wrote:

> the question is just where is that first attempt.

Hmm, it looks like doing "shutdown now" to get into single-user will 
force-unmount
the tmpfs file systems (/etc/rc.d/swap1), so you could be left in a state where
creating a regular /dev/null becomes all too easy.


As an aside, while testing out things I also noticed that MAKEDEV tries to use 
$MKNOD
in create_mfs_dev(), but this is only set if the -m switch is supplied. So the 
attempt
to make a temporary console fails.

Thanks,
- Michael



daily CVS update output

2016-07-22 Thread NetBSD source update

Updating src tree:
P src/etc/rc.d/mountcritlocal
P src/sys/arch/amd64/amd64/machdep.c
P src/sys/arch/amd64/include/pmap.h
P src/sys/arch/mips/mips/bus_dma.c
P src/sys/arch/x86/x86/pmap.c
P src/sys/dev/ata/wd.c
P src/sys/dev/ata/wdvar.h
P src/sys/net/if.c
P src/sys/net/if.h

Updating xsrc tree:


Killing core files:


Updating tar files:
src/top-level: collecting... replacing... done
src/bin: collecting... replacing... done
src/common: collecting... replacing... done
src/compat: collecting... replacing... done
src/crypto: collecting... replacing... done
src/dist: collecting... replacing... done
src/distrib: collecting... replacing... done
src/doc: collecting... replacing... done
src/etc: collecting... replacing... done
src/external: collecting... replacing... done
src/extsrc: collecting... replacing... done
src/games: collecting... replacing... done
src/gnu: collecting... replacing... done
src/include: collecting... replacing... done
src/lib: collecting... replacing... done
src/libexec: collecting... replacing... done
src/regress: collecting... replacing... done
src/rescue: collecting... replacing... done
src/sbin: collecting... replacing... done
src/share: collecting... replacing... done
src/sys: collecting... replacing... done
src/tests: collecting... replacing... done
src/tools: collecting... replacing... done
src/usr.bin: collecting... replacing... done
src/usr.sbin: collecting... replacing... done
src/config: collecting... replacing... done
src: collecting... replacing... done
xsrc/top-level: collecting... replacing... done
xsrc/external: collecting... replacing... done
xsrc/local: collecting... replacing... done
xsrc: collecting... replacing... done

Running the SUP scanner:
SUP Scan for current starting at Sat Jul 23 03:08:23 2016
SUP Scan for current completed at Sat Jul 23 03:08:38 2016
SUP Scan for mirror starting at Sat Jul 23 03:08:38 2016
SUP Scan for mirror completed at Sat Jul 23 03:10:47 2016



Updating release-6 src tree (netbsd-6):
U doc/CHANGES-6.2
P libexec/mail.local/mail.local.c

Updating release-6 xsrc tree (netbsd-6):


Updating release-6 tar files:
src/top-level: collecting... replacing... done
src/bin: collecting... replacing... done
src/common: collecting... replacing... done
src/compat: collecting... replacing... done
src/crypto: collecting... replacing... done
src/dist: collecting... replacing... done
src/distrib: collecting... replacing... done
src/doc: collecting... replacing... done
src/etc: collecting... replacing... done
src/external: collecting... replacing... done
src/extsrc: collecting... replacing... done
src/games: collecting... replacing... done
src/gnu: collecting... replacing... done
src/include: collecting... replacing... done
src/lib: collecting... replacing... done
src/libexec: collecting... replacing... done
src/regress: collecting... replacing... done
src/rescue: collecting... replacing... done
src/sbin: collecting... replacing... done
src/share: collecting... replacing... done
src/sys: collecting... replacing... done
src/tests: collecting... replacing... done
src/tools: collecting... replacing... done
src/usr.bin: collecting... replacing... done
src/usr.sbin: collecting... replacing... done
src/config: collecting... replacing... done
src/x11: collecting... replacing... done
xsrc/top-level: collecting... replacing... done
xsrc/external: collecting... replacing... done
xsrc/local: collecting... replacing... done
xsrc/xfree: collecting... replacing... done

Running the SUP scanner:
SUP Scan for release-6 starting at Sat Jul 23 03:15:49 2016
SUP Scan for release-6 completed at Sat Jul 23 03:15:58 2016



Updating release-7 src tree (netbsd-7):
U doc/CHANGES-7.1
P libexec/mail.local/mail.local.c

Updating release-7 xsrc tree (netbsd-7):


Updating release-7 tar files:
src/top-level: collecting... replacing... done
src/bin: collecting... replacing... done
src/common: collecting... replacing... done
src/compat: collecting... replacing... done
src/crypto: collecting... replacing... done
src/dist: collecting... replacing... done
src/distrib: collecting... replacing... done
src/doc: collecting... replacing... done
src/etc: collecting... replacing... done
src/external: collecting... replacing... done
src/extsrc: collecting... replacing... done
src/games: collecting... replacing... done
src/gnu: collecting... replacing... done
src/include: collecting... replacing... done
src/lib: collecting... replacing... done
src/libexec: collecting... replacing... done
src/regress: collecting... replacing... done
src/rescue: collecting... replacing... done
src/sbin: collecting... replacing... done
src/share: collecting... replacing... done
src/sys: collecting... replacing... done
src/tests: collecting... replacing... done
src/tools: collecting... replacing... done
src/usr.bin: collecting... replacing... done
src/usr.sbin: collecting... replacing... done
src/config: collecting... replacing... done
src/x11: collecting... replacing... done
src: collecting... replacing... done

Re: repeated failure to properly shutdown

2016-07-22 Thread Robert Elz
Date:Fri, 22 Jul 2016 17:09:30 -0700
From:bch 
Message-ID:  


  | Iirc, where I *noticed* it was /etc/defaults/rc.d

Yes, that (/etc/defaults/rc.conf I assume you mean) writes to /dev/null - but
init has made the tmpfs /dev (if it needs it) before it runs /etc/rc - and so
before /etc/defaults/rc.conf gets used.

There is essentially nothing possible from when the system boots before the
tmpfs /dev/ is made - when /dev/console is not there.  MAKEDEV is just about
the first thing init does in that case - which is why I initially assumed that
the problem had to be there (but MAKEDEV does literally nothing to /dev/null
except mknod it when needed.)

Given that I see two possibilities, and maybe you can remember which is more
likely?   Either /dev/null (the file in /dev on the root filesys) got created
before you initially booted the system, or it was created while you were
fixing the missing /dev/console just recently.

First, at any time did you have your new root filesystem mounted somewhere,
and chroot /to/it (that is if it were on /mnt and you did "chroot /mnt") ?

There was no need to do that to fix the missing /dev/console (and the missing
rest of /dev) and it was not what Martin suggested you do, so I am going to
guess that this did not happen today/yesterday when you were fixing things.
Sound right?

So, think back to when you first built the system.   Sometime then you would
have needed to do some configuration - did you boot first, and then configure
(stuff like the hostname, the network config, rc_configured=YES in rc.conf etc)
or did you set some of that up before you booted?  (It doesn't matter here
if it was the very first boot or not, just if you did setup only while running
the new system, or if you did some config from the system you used to run
build.sh).

If you did config using the older system - how did that happen?   Do you just
"cd /new-root/etc; edit; edit; edit ..." or did you "chroot /new-root" ... ?

kre



Re: repeated failure to properly shutdown

2016-07-22 Thread Robert Elz
Date:Fri, 22 Jul 2016 16:27:19 -0700
From:bch 
Message-ID:  


  | It could be that for some reason it's missing, and a first attempt to write
  | to it just creates a regular file...

Yes, that's a given - the question is just where is that first attempt.
It has to be before init makes the tmpfs /dev (the /dev/null created in
the tmpfs would have been the right thing, or you would have noticed that
problem much sooner) and it has to be with the live system as root (not
when it is mounted on /mnt) as nothing is likely to accidentally write
to /mnt/dev/null ...

A chroot to /mnt might do it I suppose...

Michael Plass  said:

  | Could it perhaps come from the ( set -o tabcomplete 2>/dev/null ) in /etc/
  | shrc? 

Not in normal operation, /etc/shrc wouldn't normally be able to run until
way after init has created the tmpfs /dev - init doesn't set ENV, so the
sh it runs to execute MAKEDEV wouldn't source shrc - ENV set to /etc/shrc
normally comes from /root/.profile which would be used only when root logs in.

kre




Re: repeated failure to properly shutdown

2016-07-22 Thread Michael Plass
On Jul 22, 2016, at 4:24 PM, Robert Elz wrote:

>Date:Sat, 23 Jul 2016 04:38:42 +0700
>From:Robert Elz 
>Message-ID:  <20406.1469223...@andromeda.noi.kre.to>
> 
>  | That /dev/null turned into a regular file is another bug [...]
>  | (This turns out to be a bug in MAKEDEV [...]
> 
> Actually, not, it must be elsewhere, or as a result of something else.
> 
> kre
> 
> 
> 


Could it perhaps come from the ( set -o tabcomplete 2>/dev/null ) in /etc/shrc?

- Michael

Re: repeated failure to properly shutdown

2016-07-22 Thread Robert Elz
Date:Sat, 23 Jul 2016 04:38:42 +0700
From:Robert Elz 
Message-ID:  <20406.1469223...@andromeda.noi.kre.to>

  | That /dev/null turned into a regular file is another bug [...]
  | (This turns out to be a bug in MAKEDEV [...]

Actually, not, it must be elsewhere, or as a result of something else.

kre



Re: repeated failure to properly shutdown

2016-07-22 Thread Rhialto
On Sat 23 Jul 2016 at 04:38:42 +0700, Robert Elz wrote:
> That /dev/null turned into a regular file is another bug - it is being
> created before the tmpfs /dev is made, I have seen that before as well,
> but just corrected and ignored the problem until now.   

Similarly, I noticed that if /var is a tmpfs (or any initially empty
directory really), then /etc/rc.d/mountcritlocal fails because it wants
to cd to /var/run and that has not been created (if that ever happens).

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- Wayland: Those who don't understand X
\X/ rhialto/at/xs4all.nl-- are condemned to reinvent it. Poorly.


signature.asc
Description: PGP signature


Re: repeated failure to properly shutdown

2016-07-22 Thread Robert Elz
Date:Fri, 22 Jul 2016 12:52:58 -0700
From:bch 
Message-ID:  


  | I think that biggest concern (unclean shutdown/reboot) is solved (collision
  | of /dev and a tmpfs mount, caused by default behavior of init in face of
  | missing /dev/console).

Yes, and now we know what the cause is, we should be able to duplicate the
problem, and work out what is really happening.   The system is supposed to
work with a tmpfs /dev, it should not panic during shutdown.   What's more,
this panic is probably not related to it being /dev - any tmpfs mounted
with -o union on a mount point that is using WAPBL (-o log) will probably
panic the same way.

  | This disk was prepared remotely (I.e. from another running NetBSD box) by
  | partitioning the disk (disklabel), formatting (newfs), then mounting all
  | partitions appropriately under /mnt and running ./build.sh ... install=/mnt

That builds enough for the system to install, but it does not make a
fully runnable system - there's more that sysinst  normally does (like
populating /dev - but also making a basic rc.conf (including all network
config, setting hostname) and fstab, that don't get built by build.sh
either (and nor should they).   Those you must have done later.  Running
MAKEDEV is just another of the steps that one needs to perform.

That /dev/null turned into a regular file is another bug - it is being
created before the tmpfs /dev is made, I have seen that before as well,
but just corrected and ignored the problem until now.   (This turns out
to be a bug in MAKEDEV which is run by init to make the tmpfs /dev when
/dev/console is not present.)

Your solution to that was correct.   (MAKEDEV did not fix it as it never
replaces anything that already exists, only makes what does not - even if
what exists is nonsense, and even if it created the nonsense itself.)

It is good that your problems are overcome now - and thanks for bring it to
our attention, and for being willing to suffer through getting enough info
to allow the problem to be better understood.

kre



Re: repeated failure to properly shutdown

2016-07-22 Thread Martin Husemann
On Fri, Jul 22, 2016 at 11:37:56AM -0700, bch wrote:
> How does that happen, how does one fix it ?

It is created by init if there is no /dev/console.

Boot some install media, mount your root file system (say on /mnt)
then: 

cd /mnt/dev
sh MAKEDEV all

(hoping there is a MAKEDEV script there, if not: extract it from etc.tgz
from the install sets)

Then reboot and check mount again.

Martin


Re: repeated failure to properly shutdown

2016-07-22 Thread bch
Wow -- there -is- a tmpfs on /dev

kamloops# mount
/dev/wd0a on / type ffs (log, local)
->  tmpfs on /dev type tmpfs (union, local)
/dev/wd0e on /var type ffs (log, local)
/dev/wd0f on /usr type ffs (log, local)
/dev/wd0g on /home type ffs (log, local)
kernfs on /kern type kernfs (local)
ptyfs on /dev/pts type ptyfs (local)
procfs on /proc type procfs (local)
tmpfs on /var/shm type tmpfs (local)


But no entry for it in fstab...

# NetBSD /etc/fstab
# See /usr/share/examples/fstab/ for more examples.
/dev/wd0a   /   ffs rw,log   1 1
/dev/wd0b   noneswapsw,dp0 0
/dev/wd0f   /usrffs rw,log   1 2
/dev/wd0e   /varffs rw,log   1 2
/dev/wd0g   /home   ffs rw,log   1 2
kernfs  /kern   kernfs  rw
ptyfs   /dev/ptsptyfs   rw
procfs  /proc   procfs  rw
/dev/cd0a   /cdrom  cd9660  ro,noauto

tmpfs   /var/shmtmpfs   rw,-m1777,-sram%25



How does that happen, how does one fix it ?




On 7/22/16, Ian D. Leroux  wrote:
>
>
> On Fri, Jul 22, 2016, at 14:00, Robert Elz wrote:
>> Date:Fri, 22 Jul 2016 07:11:50 -0400 From:"Ian D.
>> Leroux"  Message-ID:
>> <20160722071150.5248712b562feea8d5c89...@fastmail.fm>
>>
>>   | Might this be a good moment to test them out and commit them?
>>
>> Perhaps, but not really as a fix for the current problem -- we already
>> know, from what we have been told, that not doing the tmpfs umount
>> avoids the crash ... what I, at least, would like to find is why the
>> crash happens at all, rather than just work around it.
>
> Fair enough.
>
>> That won't make umounting a tmpfs /dev any more rational to do though
>> (but just a tmpfs that happens to contain a device node is perhaps not
>> the right test for what to avoid, and manual specification when that
>> fails to DTRT isn't a great alternative.)
>
> I'm not sure there *is* a truly correct test for what to avoid, given
> the nature of what's being done at swapoff, but there may well be better
> heuristics.  I don't want to derail this thread though, so we can take
> that up separately at a later date.
>
> Good luck fixing the crash!
>
> -- IDL
>


panic after 6.x -> 7.x upgrade

2016-07-22 Thread Manuel Bouyer
Hello,
I've upgraded a server from 6.x to 7.x and it became unstable.
I first did upgrade the kernel (7.0_STABLE from some time ago),
keeping the 6.x userland, and it did run for more than 24h without troubles.
Then I did upgrade the userland and problems started.
Some filesystems are plain ffs, /usr and /var are ffs+wapbl.
/tmp is mfs (not tmpfs because I have quotas here).

First, after userland upgrade, it didn't reboot (a reboot did kill
processes, but then noting happended). I could enter ddb from here
and type 'reboot' but the disks didn't get flushed. I didn't investigate
from ddb, unfortunably.

After reboot and fsck I got, while going multiuser:
err panic: kernel diagnostic assertion "(*vpp)->v_type == VNON" failed: file "/h
ome/bouyer/src-7/src/sys/ufs/ffs/ffs_alloc.c", line 615 
cpu5: Begin traceback...
vpanic() at netbsd:vpanic+0x13c
kern_assert() at netbsd:kern_assert+0x4f
ffs_valloc() at netbsd:ffs_valloc+0x8b4
ufs_makeinode() at netbsd:ufs_makeinode+0x5e
ufs_create() at netbsd:ufs_create+0x5b
VOP_CREATE() at netbsd:VOP_CREATE+0x3d
vn_open() at netbsd:vn_open+0x3WA2R9^MNI
NdoG:_ oSpPenL (N)O aTt  LOneWtERbEsdD: dONo_ oSpYeSnC+AL0Lx1 111 4
0d EoX_IsTys _f4o4pe0nf5at1(0 )7 a^Mt
 netbsd:do_sys_openat+0x68
sys_open() at netbsd:sys_open+0x24
syscall() at netbsd:syscall+0x9a
--- syscall (number 5) ---
7f7ff643c40a:
cpu5: End traceback...

no core dump unfortunably (paniced a second time in wddump).

I did force a fsck on log filesystems. The system came up multiuser and
ran for about 8 hours, then:
panic: wapbl_register_deallocation: out of resources
cpu1: Begin tracebackW.A.R.^MNI
NvpG:an SicPL( ) NaOTt  LneOtWbEsREd:D vOpNa nSiYcS+C0Ax1L3Lc ^M0
 0s npErXIinTt ff7()be 4a0t0 n0e 7tb^Ms
d:snprintf
wapbl_register_inode() at netbsd:wapbl_register_inode
ffs_indirtrunc() at netbsd:ffs_indirtrunc+0x3df
ffs_truncate() at netbsd:ffs_truncate+0xc43
ufs_direnter() at netbsd:ufs_direnter+0x545
ufs_makeinode() at netbsd:ufs_makeinode+0x2c3
ufs_create() at netbsd:ufs_create+0x5b
VOP_CREATE() at netbsd:VOP_CREATE+0x3d
vn_open() at netbsd:vn_open+0x329
do_open() at netbsd:do_open+0x111
do_sys_openat() at netbsd:do_sys_openat+0x68
sys_open() at netbsd:sys_open+0x24
syscall() at netbsd:syscall+0x9a
--- syscall (number 5) ---
7f7ff583c40a:
cpu1: End traceback...

again no core dump (this time: insufficient space 8806272 < 9472135)

the server would then panic again with the same backtrace while going
multiuser (and this time I got a code dump).
So I disabled log on all filesystems, and it has been stable since
then.

Does it ring a bell ? 

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: repeated failure to properly shutdown

2016-07-22 Thread Ian D. Leroux


On Fri, Jul 22, 2016, at 14:00, Robert Elz wrote:
> Date:Fri, 22 Jul 2016 07:11:50 -0400 From:"Ian D.
> Leroux"  Message-ID:
> <20160722071150.5248712b562feea8d5c89...@fastmail.fm>
>
>   | Might this be a good moment to test them out and commit them?
>
> Perhaps, but not really as a fix for the current problem -- we already
> know, from what we have been told, that not doing the tmpfs umount
> avoids the crash ... what I, at least, would like to find is why the
> crash happens at all, rather than just work around it.

Fair enough.

> That won't make umounting a tmpfs /dev any more rational to do though
> (but just a tmpfs that happens to contain a device node is perhaps not
> the right test for what to avoid, and manual specification when that
> fails to DTRT isn't a great alternative.)

I'm not sure there *is* a truly correct test for what to avoid, given
the nature of what's being done at swapoff, but there may well be better
heuristics.  I don't want to derail this thread though, so we can take
that up separately at a later date.

Good luck fixing the crash!

-- IDL


Re: repeated failure to properly shutdown

2016-07-22 Thread Robert Elz
Date:Fri, 22 Jul 2016 07:11:50 -0400
From:"Ian D. Leroux" 
Message-ID:  <20160722071150.5248712b562feea8d5c89...@fastmail.fm>

  | Might this be a good moment to test them out and commit them?

Perhaps, but not really as a fix for the current problem -- we already
know, from what we have been told, that not doing the tmpfs umount
avoids the crash ... what I, at least, would like to find is why the
crash happens at all, rather than just work around it.

If it turns out that the tmpfs being umounted is /dev (and not /var/shm
which is the only tmpfs in Brad's fstab - but which is unlikely to have
any union mounts anywhere near it - and not something else that is being
mounted some other way) then we should be able to reproduce the exact
environment, and work out why the crash happens, and then fix it.

That won't make umounting a tmpfs /dev any more rational to do though
(but just a tmpfs that happens to contain a device node is perhaps not
the right test for what to avoid, and manual specification when that fails
to DTRT isn't a great alternative.)

kre



Re: repeated failure to properly shutdown

2016-07-22 Thread Robert Elz
Date:Fri, 22 Jul 2016 00:33:01 -0700
From:bch 
Message-ID:  


  | Confirm this stack frame is the/a one we care about?

It looks right, yes, though one level further up should give the
same results (vp is passed in as a param, so the same one exists up one level.)

kre



Re: repeated failure to properly shutdown

2016-07-22 Thread Ian D. Leroux
On Fri, 22 Jul 2016 16:57:08 +0700 Robert Elz  wrote:
> "J. Hannken-Illjes"  said:
> 
>   | No populated "/dev" so it uses dev on tmpfs? 
> 
> Ah yes, very possible - the output from mount will tell us that, but I
> remember earlier reports of problems from unmounting (or attempting
> to) an unmount of /dev (and hardly surprising really.)

Tangentially, I wrote (some of) those earlier reports of problems with
unmounting /dev at shutdown and my patches to fix the prolblem are
still sitting uncommitted in bin/51019.  Might this be a good moment to
test them out and commit them?  If it doesn't fix OP's problem, it
would at least rule out one cause.

-- IDL


Re: repeated failure to properly shutdown

2016-07-22 Thread bch
I can do that tomorrow, yes. Confirm this stack frame is the/a one we care
about?

Regards,

-bch
On Jul 21, 2016 11:42 PM, "Martin Husemann"  wrote:

On Thu, Jul 21, 2016 at 04:38:57PM -0700, bch wrote:
> and the v_mount refcounts and flags are:
>
> (gdb) print vp->v_mount
> $2 = (struct mount *) 0xfe81081c2008
> (gdb) print vp->v_mount->mnt_refcnt
> $3 = 2501
> (gdb) print vp->v_mount->mnt_flag
> $4 = 4128
> (gdb)

can you also show

 print *vp
 print *vp->v_mount

please?

Thanks,

Martin


Re: repeated failure to properly shutdown

2016-07-22 Thread Robert Elz
Date:Fri, 22 Jul 2016 08:45:44 +
From:co...@sdf.org
Message-ID:  <20160722084544.ga14...@sdf.org>

  | probably good to remember that it's also saying it's double freed.
  | is it garbage data because it was freed before?

Perhaps, we will get a better idea when we see the full struct mount that
Martin requested.   But I doubt that was garbage, it is too "reasonable"
a value, for some definition of reasonable.


"J. Hannken-Illjes"  said:

  | No populated "/dev" so it uses dev on tmpfs? 

Ah yes, very possible - the output from mount will tell us that, but I
remember earlier reports of problems from unmounting (or attempting to)
an unmount of /dev (and hardly surprising really.)   At least it looks as
if we might be getting closer to an understanding of the setup that causes
the problem so it can be duplicated, and debugged quicker that once a
day turnaround of these e-mail messages...

kre




Re: repeated failure to properly shutdown

2016-07-22 Thread J. Hannken-Illjes

> On 22 Jul 2016, at 10:39, Robert Elz  wrote:
> 
>Date:Thu, 21 Jul 2016 16:38:57 -0700
>From:bch 
>Message-ID:  
> 

Re: repeated failure to properly shutdown

2016-07-22 Thread coypu
On Fri, Jul 22, 2016 at 03:39:26PM +0700, Robert Elz wrote:
> Date:Thu, 21 Jul 2016 16:38:57 -0700
> From:bch 
> Message-ID:  
> 

Re: repeated failure to properly shutdown

2016-07-22 Thread Robert Elz
Date:Thu, 21 Jul 2016 16:38:57 -0700
From:bch 
Message-ID: