Re: Panic on a -current from 13/12/2018

2018-12-16 Thread Masanobu SAITOH

On 2018/12/17 1:09, Chavdar Ivanov wrote:

I have no idea. As I said, it is running under VirtualBox on a Windows
10 host; I put the host in hibernation whilst the NetBSD guest is
running.


I tested today's -current on VirtualBox 5.2.22 on Windows 7 64bit
(on Core i7-2600). I tried hybernate(shutdown ->hybernate(H)) a few times
but I couldn't reproduce the problem yet.


 while (deltat > 0) {
 xtick = lapic_gettick();
 if (lapic_broken_periodic && xtick == 0 && otick == 0) {
 lapic_initclocks();
 xtick = lapic_gettick();
 if (xtick == 0)
 panic("lapic timer stopped ticking");   
<=== here!
 }


If that panic is from this, lapic_broken_periodic must be true, but it's set 
only
when the VM is KVM:

/*
 * Apply workaround for broken periodic timer under KVM
 */
if (vm_guest == VM_GUEST_KVM) {
lapic_broken_periodic = true;
lapic_timecounter.tc_quality = -100;
aprint_debug_dev(ci->ci_dev,
"applying KVM timer workaround\n");
}


 Could you try to reproduce the problem and see the panic message?
ci4ic4-panic-01.png has backtrace and it wiped out the panic message.

 Regards.


Previously it survived this, using the Intel Desktop NIC
emulation within VirtualBox, even my ssh connections (from the host to
the guest) remained active. I switched the NIC emulation for the
NetBSD guest to virtio-net, now it behaves as before, surviving a
hibernation.

There was a VirtualBox upgrade a few weeks ago, perhaps the problem is there.
On Sun, 16 Dec 2018 at 15:55, SAITOH Masanobu  wrote:


Hi.

On 2018/12/16 18:09, Chavdar Ivanov wrote:

Repeated this morning. Happens when the host hibernates when the
machine is running. The initial trace is slightly different, but the
lines with wm_gmii are the same, so for now I will switch to a
different NIC emulator.



In your .png:

vpanic()
lapic_delay()
wm_gmii_mdic_readreg()
.
.
.


There is no panic message itself, but I suspect it's:

static void
lapic_delay(unsigned int usec)
{
 int32_t xtick, otick;
 int64_t deltat; /* XXX may want to be 64bit */

 otick = lapic_gettick();

 if (usec <= 0)
 return;
 if (usec <= 25)
 deltat = lapic_delaytab[usec];
 else
 deltat = (lapic_frac_cycle_per_usec * usec) >> 32;

 while (deltat > 0) {
 xtick = lapic_gettick();
 if (lapic_broken_periodic && xtick == 0 && otick == 0) {
 lapic_initclocks();
 xtick = lapic_gettick();
 if (xtick == 0)
 panic("lapic timer stopped ticking");   
<=== here!
 }
 if (xtick > otick)
 deltat -= lapic_tval - (xtick - otick);
 else
 deltat -= otick - xtick;
 otick = xtick;

 x86_pause();
 }
}


Why does it cause?



And yes, it used to survive many hibernations of the hosts before. I
only had to adjust the time after waking the host up.
On Sat, 15 Dec 2018 at 10:59, Chavdar Ivanov  wrote:


Hi,

On 8.99.27 AMD64 running under VirtualBox I got this morning the panic
in http://ci4ic4.tx0.org/ci4ic4-panic-01.png

I have the  coredump, if it is of interest. I thought it might be
useful, as it is apparently in the wm driver.

Chavdar
--








--
---
 SAITOH Masanobu (msai...@execsw.org
  msai...@netbsd.org)







--
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)


Re: UVMHIST, pmap_get_physpage panic

2018-12-16 Thread Maxime Villard

Le 17/12/2018 à 08:10, Thomas Klausner a écrit :

On Mon, Dec 17, 2018 at 08:06:36AM +0100, Maxime Villard wrote:

Le 16/12/2018 à 09:09, Thomas Klausner a écrit :

[ 16674.534547] panic: pmap_get_physpage: out of memory


Well, out of memory means out of memory. KASAN consumes a bit more than
1/8 of the KVA. So if in normal times your system would use 8GB of ram,
KASAN adds an extra ~1.1GB.


So why doesn't it kill userland processes? I don't believe my kernel
needs all 32GB of RAM.


I don't know. In fact I don't understand how it is normal to get this:

[ 16674.544550] pmap_growkernel() at netbsd:pmap_growkernel
[ 16674.544550] kasan_shadow_map() at netbsd:kasan_shadow_map+0xff
[ 16674.544550] pmap_growkernel() at netbsd:pmap_growkernel+0x283

pmap_growkernel() does

mutex_enter(kpm->pm_lock);

So if it's called recursively I think we have a problem. The call
path is:

pmap_growkernel -> kasan_shadow_map -> pmap_get_physpage ->
[somewhere we need to allocate KVA] -> pmap_growkernel

This problem is not KASAN-specific, because KASAN just duplicates
the existing logic:

pmap_growkernel -> pmap_alloc_level -> pmap_get_physpage

Maybe KASAN makes the problem more visible.

Do you also get out-of-memory when you disable UVMHIST?


Re: UVMHIST, pmap_get_physpage panic

2018-12-16 Thread Thomas Klausner
On Mon, Dec 17, 2018 at 08:06:36AM +0100, Maxime Villard wrote:
> Le 16/12/2018 à 09:09, Thomas Klausner a écrit :
> > [ 16674.534547] panic: pmap_get_physpage: out of memory
> 
> Well, out of memory means out of memory. KASAN consumes a bit more than
> 1/8 of the KVA. So if in normal times your system would use 8GB of ram,
> KASAN adds an extra ~1.1GB.

So why doesn't it kill userland processes? I don't believe my kernel
needs all 32GB of RAM.
 Thomas


Re: UVMHIST, pmap_get_physpage panic

2018-12-16 Thread Maxime Villard

Le 16/12/2018 à 09:09, Thomas Klausner a écrit :

[ 16674.534547] panic: pmap_get_physpage: out of memory


Well, out of memory means out of memory. KASAN consumes a bit more than
1/8 of the KVA. So if in normal times your system would use 8GB of ram,
KASAN adds an extra ~1.1GB.


daily CVS update output

2018-12-16 Thread NetBSD source update


Updating src tree:
P src/distrib/amd64/liveimage/emuimage/Makefile
P src/doc/CHANGES
P src/lib/libc/hash/md2/md2.3
P src/lib/librumphijack/hijack.c
P src/lib/librumphijack/rumphijack.3
P src/lib/libtelnet/auth.c
P src/sys/arch/arm/cortex/scu_reg.h
P src/sys/arch/arm/imx/imx6_pcie.c
P src/sys/arch/evbarm/nitrogen6/nitrogen6_machdep.c
P src/sys/arch/x86/x86/identcpu.c
P src/sys/arch/x86/x86/lapic.c
P src/sys/kern/files.kern
P src/sys/kern/subr_pool.c
U src/sys/kern/subr_thmap.c
P src/sys/netinet/dccp_usrreq.c
P src/sys/netinet/tcp_usrreq.c
P src/sys/netinet6/nd6.c
P src/sys/rump/librump/rumpkern/Makefile.rumpkern
P src/sys/sys/pool.h
P src/sys/sys/socketvar.h
U src/sys/sys/thmap.h
P src/tests/fs/common/fstest_zfs.c
P src/tests/fs/zfs/t_zpool.sh
P src/tests/lib/libc/net/getaddrinfo/no_serv_v4.exp
P src/usr.bin/make/parse.c
P src/usr.bin/make/unit-tests/varquote.mk
P src/usr.sbin/ndp/ndp.c
P src/usr.sbin/sysinst/Makefile.inc
P src/usr.sbin/sysinst/defs.h
P src/usr.sbin/sysinst/main.c

Updating xsrc tree:


Killing core files:



Updating release-7 src tree (netbsd-7):

Updating release-7 xsrc tree (netbsd-7):



Updating release-8 src tree (netbsd-8):
U doc/CHANGES-8.1
P sys/arch/x86/pci/amdnb_misc.c
P sys/arch/x86/pci/amdtemp.c

Updating release-8 xsrc tree (netbsd-8):




Updating file list:
-rw-rw-r--  1 srcmastr  netbsd  52414655 Dec 17 03:09 ls-lRA.gz


Re: Panic on a -current from 13/12/2018

2018-12-16 Thread Chavdar Ivanov
I have no idea. As I said, it is running under VirtualBox on a Windows
10 host; I put the host in hibernation whilst the NetBSD guest is
running. Previously it survived this, using the Intel Desktop NIC
emulation within VirtualBox, even my ssh connections (from the host to
the guest) remained active. I switched the NIC emulation for the
NetBSD guest to virtio-net, now it behaves as before, surviving a
hibernation.

There was a VirtualBox upgrade a few weeks ago, perhaps the problem is there.
On Sun, 16 Dec 2018 at 15:55, SAITOH Masanobu  wrote:
>
> Hi.
>
> On 2018/12/16 18:09, Chavdar Ivanov wrote:
> > Repeated this morning. Happens when the host hibernates when the
> > machine is running. The initial trace is slightly different, but the
> > lines with wm_gmii are the same, so for now I will switch to a
> > different NIC emulator.
> >
>
> In your .png:
> >vpanic()
> >lapic_delay()
> >wm_gmii_mdic_readreg()
> >.
> >.
> >.
>
> There is no panic message itself, but I suspect it's:
> > static void
> > lapic_delay(unsigned int usec)
> > {
> > int32_t xtick, otick;
> > int64_t deltat; /* XXX may want to be 64bit */
> >
> > otick = lapic_gettick();
> >
> > if (usec <= 0)
> > return;
> > if (usec <= 25)
> > deltat = lapic_delaytab[usec];
> > else
> > deltat = (lapic_frac_cycle_per_usec * usec) >> 32;
> >
> > while (deltat > 0) {
> > xtick = lapic_gettick();
> > if (lapic_broken_periodic && xtick == 0 && otick == 0) {
> > lapic_initclocks();
> > xtick = lapic_gettick();
> > if (xtick == 0)
> > panic("lapic timer stopped ticking");   
> > <=== here!
> > }
> > if (xtick > otick)
> > deltat -= lapic_tval - (xtick - otick);
> > else
> > deltat -= otick - xtick;
> > otick = xtick;
> >
> > x86_pause();
> > }
> > }
>
> Why does it cause?
>
>
> > And yes, it used to survive many hibernations of the hosts before. I
> > only had to adjust the time after waking the host up.
> > On Sat, 15 Dec 2018 at 10:59, Chavdar Ivanov  wrote:
> >>
> >> Hi,
> >>
> >> On 8.99.27 AMD64 running under VirtualBox I got this morning the panic
> >> in http://ci4ic4.tx0.org/ci4ic4-panic-01.png
> >>
> >> I have the  coredump, if it is of interest. I thought it might be
> >> useful, as it is apparently in the wm driver.
> >>
> >> Chavdar
> >> --
> >> 
> >
> >
> >
>
>
> --
> ---
> SAITOH Masanobu (msai...@execsw.org
>  msai...@netbsd.org)



-- 



Re: Panic on a -current from 13/12/2018

2018-12-16 Thread SAITOH Masanobu
Hi.

On 2018/12/16 18:09, Chavdar Ivanov wrote:
> Repeated this morning. Happens when the host hibernates when the
> machine is running. The initial trace is slightly different, but the
> lines with wm_gmii are the same, so for now I will switch to a
> different NIC emulator.
> 

In your .png:
>vpanic()
>lapic_delay()
>wm_gmii_mdic_readreg()
>.
>.
>.

There is no panic message itself, but I suspect it's:
> static void
> lapic_delay(unsigned int usec)
> {
> int32_t xtick, otick;
> int64_t deltat; /* XXX may want to be 64bit */
> 
> otick = lapic_gettick();
> 
> if (usec <= 0)
> return;
> if (usec <= 25)
> deltat = lapic_delaytab[usec];
> else
> deltat = (lapic_frac_cycle_per_usec * usec) >> 32;
> 
> while (deltat > 0) {
> xtick = lapic_gettick();
> if (lapic_broken_periodic && xtick == 0 && otick == 0) {
> lapic_initclocks();
> xtick = lapic_gettick();
> if (xtick == 0)
> panic("lapic timer stopped ticking");   
> <=== here!
> }
> if (xtick > otick)
> deltat -= lapic_tval - (xtick - otick);
> else
> deltat -= otick - xtick;
> otick = xtick;
> 
> x86_pause();
> }
> }

Why does it cause?


> And yes, it used to survive many hibernations of the hosts before. I
> only had to adjust the time after waking the host up.
> On Sat, 15 Dec 2018 at 10:59, Chavdar Ivanov  wrote:
>>
>> Hi,
>>
>> On 8.99.27 AMD64 running under VirtualBox I got this morning the panic
>> in http://ci4ic4.tx0.org/ci4ic4-panic-01.png
>>
>> I have the  coredump, if it is of interest. I thought it might be
>> useful, as it is apparently in the wm driver.
>>
>> Chavdar
>> --
>> 
> 
> 
> 


-- 
---
SAITOH Masanobu (msai...@execsw.org
 msai...@netbsd.org)


Re: build.sh syspkgs

2018-12-16 Thread Olaf Seibert
On Sun 16 Dec 2018 at 14:46:54 +0100, Rhialto wrote:
> regpkg: ERROR: The metalog file 
> (/vol1/rhialto/destdir.amd64/METALOG.sanitised) does not
> contain entries for the following files or directories
> which should be part of the base-util-root syspkg:
> ./bin/\133
> --- makesyspkgs ---
> *** [makesyspkgs] Error code 128
> nbmake[1]: stopped in /mnt/vol1/rhialto/cvs/src/distrib/sets
> 1 error

From the cvs history, I see that the last struggle with this was in 2014.

Here is a potential patch:

Index: join.awk
===
RCS file: /cvsroot/src/distrib/sets/join.awk,v
retrieving revision 1.6
diff -u -r1.6 join.awk
--- join.awk24 Oct 2014 22:19:44 -  1.6
+++ join.awk16 Dec 2018 15:08:42 -
@@ -30,6 +30,8 @@
 # join.awk F1 F2
 #  Similar to join(1), this reads a list of words from F1
 #  and outputs lines in F2 with a first word that is in F1.
+#  For purposes of matching the first word, both instances are
+#  canonicalised via unvis(word); the version from F2 is printed.
 #  Neither file needs to be sorted
 
 function unvis(s) \
@@ -79,17 +81,16 @@
exit 1
}
while ( (getline < ARGV[1]) > 0) {
-   $1 = unvis($1)
-   words[$1] = $0
+   f1 = unvis($1)
+   words[f1] = $0
}
delete ARGV[1]
 }
 
-// { $1 = unvis($1) }
+{ f1 = unvis($1) }
 
-$1 in words \
+f1 in words \
 {
-   f1=$1
$1=""
print words[f1] $0
 }

This join.awk script is used to take the file names that are in a
PLIST-type file and select just those same lines from the METALOG file.

I think that the issue was that the join.awk script would unvis() the
file names in all cases.

The PLIST would have /bin/[ (which is vis()ed to \133 at regpkg:810 to
spec1) and the METALOG would have /bin/\133 too. The resulting output
metalog-type file spec2 would contain the unvis()ed /bin/[ again. This
would happen at cvs/src/distrib/sets/regppkg line 818.

Then after that there would be a check at regpkg:836 which compares if
spec1 and spec2 contain the same names, but this is not the case since
one of them is unvis()ed. Hence the error message, which for our
purposes is likely spurious.

I fix the undesired unvis()ing in the first chunk of the diff. As it
was, changing $1 (the first field of the line) changes $0, the line as a
whole. Then the unvis()ed line from the METALOG is stored (and maybe
later printed). Using a temporary to store the unvis()ed version, to be
used as the key, preserves the original version.

Likewise, in the remaining changes, I reinstate the unvis() call, but
also use a temporary for cleanness.

So this presumes that the METALOG file is properly vis()ed, but I think
that is a fairly safe assumption.

Comments? Ok to commit?

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- "What good is a Ring of Power
\X/ rhialto/at/falu.nl  -- if you're unable...to Speak." - Agent Elrond


signature.asc
Description: PGP signature


build.sh syspkgs

2018-12-16 Thread Rhialto
For the fun of it, I tried a "build.sh syspkgs", because I saw it as a
subcommand of build.sh and I hadn't heard about it for a while.

Is this actually supposed to work, or was this in the process of being
removed but not completely?

Anyway, it started out well but then stopped with this error:

regpkg: WARNING: no comment for "base-x11-root" (using placeholder)
regpkg: WARNING: no description for "base-x11-root" (re-using comment)
Registered base-x11-root-8.99.27.0.20181215
  Packaged base-x11-root-8.99.27.0.20181215.tgz
Registered base-util-root-8.99.27.0.20181212
regpkg: ERROR: The metalog file (/vol1/rhialto/destdir.amd64/METALOG.sanitised) 
does not
contain entries for the following files or directories
which should be part of the base-util-root syspkg:
./bin/\133
--- makesyspkgs ---
*** [makesyspkgs] Error code 128
nbmake[1]: stopped in /mnt/vol1/rhialto/cvs/src/distrib/sets
1 error

Actuallu, the named METALOG.sanitised does contain a line for exactly
that spelling:

./bin/\133 type=file uname=root gname=wheel mode=0555 size=18416
sha256=887c6f1483584be2d8a8247cccef74592807859f88a5ba1b193f43fe47d81132

However the file cvs/src/distrib/sets/lists/base/mi references the file
like this:

./bin/[ base-util-root

So it seem that somewhere along the line, the [ gets escaped for the
error message but not for the actual check in the METALOG file.

Or maybe the entry in METALOG should not be escaped? Anybody knows?

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- "What good is a Ring of Power
\X/ rhialto/at/falu.nl  -- if you're unable...to Speak." - Agent Elrond


signature.asc
Description: PGP signature


Re: Panic on a -current from 13/12/2018

2018-12-16 Thread Chavdar Ivanov
Repeated this morning. Happens when the host hibernates when the
machine is running. The initial trace is slightly different, but the
lines with wm_gmii are the same, so for now I will switch to a
different NIC emulator.

And yes, it used to survive many hibernations of the hosts before. I
only had to adjust the time after waking the host up.
On Sat, 15 Dec 2018 at 10:59, Chavdar Ivanov  wrote:
>
> Hi,
>
> On 8.99.27 AMD64 running under VirtualBox I got this morning the panic
> in http://ci4ic4.tx0.org/ci4ic4-panic-01.png
>
> I have the  coredump, if it is of interest. I thought it might be
> useful, as it is apparently in the wm driver.
>
> Chavdar
> --
> 



-- 



UVMHIST, pmap_get_physpage panic

2018-12-16 Thread Thomas Klausner
Hi!

I've been adding UVMHIST to my kernel config (now its GENERIC + KASAN
+ UVMHIST). I noticed that UVMHIST slowed the machine down a bit (not
by a factor of two, but in the ballpark, for bulk builds). And I had
two panics since.

The machine is doing a bulk build (in a tmpfs) and some file I/O (via
NFS mostly).

The first panic was the usual SPL NOT LOWERED gibberish (attached).

The second was:

[ 16674.534547] panic: pmap_get_physpage: out of memory
[ 16674.534547] cpu10: Begin traceback...
[ 16674.534547] vpanic() at netbsd:vpanic+0x221
[ 16674.534547] snprintf() at netbsd:snprintf
[ 16674.544550] pmap_growkernel() at netbsd:pmap_growkernel
[ 16674.544550] kasan_shadow_map() at netbsd:kasan_shadow_map+0xff
[ 16674.544550] pmap_growkernel() at netbsd:pmap_growkernel+0x283
[ 16674.554553] uvm_map_prepare() at netbsd:uvm_map_prepare+0xe14
[ 16674.554553] uvm_map() at netbsd:uvm_map+0xec
[ 16674.564557] uvm_km_alloc() at netbsd:uvm_km_alloc+0x466
[ 16674.564557] pool_grow() at netbsd:pool_grow+0xbb
[ 16674.574561] pool_catchup() at netbsd:pool_catchup+0x46
[ 16674.574561] pool_get() at netbsd:pool_get+0x7e1
[ 16674.584564] allocbuf() at netbsd:allocbuf+0x119
[ 16674.584564] getblk() at netbsd:getblk+0x185
[ 16674.584564] bio_doread() at netbsd:bio_doread+0x1b
[ 16674.594568] bread() at netbsd:bread+0x18
[ 16674.594568] ffs_init_vnode() at netbsd:ffs_init_vnode+0x1cd
[ 16674.604572] ffs_loadvnode() at netbsd:ffs_loadvnode+0xc8
[ 16674.604572] vcache_get() at netbsd:vcache_get+0x4f4
[ 16674.604572] ufs_lookup() at netbsd:ufs_lookup+0x1320
[ 16674.614575] VOP_LOOKUP() at netbsd:VOP_LOOKUP+0xb6
[ 16674.614575] lookup_once() at netbsd:lookup_once+0x34b
[ 16674.624579] namei_tryemulroot() at netbsd:namei_tryemulroot+0x87d
[ 16674.624579] namei() at netbsd:namei+0x65
[ 16674.634583] fd_nameiat.isra.2() at netbsd:fd_nameiat.isra.2+0xd1
[ 16674.634583] do_sys_statat() at netbsd:do_sys_statat+0x111
[ 16674.644586] sys___lstat50() at netbsd:sys___lstat50+0x85
[ 16674.644586] syscall() at netbsd:syscall+0x308
[ 16674.644586] --- syscall (number 441) ---
[ 16674.644586] 761a961145aa:
[ 16674.644586] cpu10: End traceback...

I have a kernel core dump for this one.

Is this a bug or do I need to get more RAM?

Comments on UVMHIST performance cost and the first panic are also
appreciated.

Thanks,
 Thomas


panic.gz
Description: application/gunzip