from:"Jaromír Doleček"

Re: Samba DC provisioning fails with Posix ACL enabled FFS

2021-11-29 Thread Jaromír Doleček

UFS_ACL enabled in XEN3_DOMU now.

Le lun. 29 nov. 2021 à 17:46, Matthias Petermann  a écrit 
:
>
> Am 28.11.21 um 17:32 schrieb Christos Zoulas:
> > Thanks for the bug report :-)
> >
> > christos
> >
>
> You're welcome :-)
>
> One more small question: currently the UFS_ACL option in the XEN3_DOMU
> is not enabled by default for the amd64 architecture. For XEN_DOM0 the
> option is enabled. I guess that the main use case for the ACLs for many
> users will be Samba. If one installs Samba on a Xen system, it will
> probably be in a DOMU rather than a DOM0.
>
> What do you think about enabling this UFS_ACL for XEN3_DOMU as well?
>
> Kind regards
> Matthias

Re: build.sh live-image [virtio disk hang]

2021-06-01 Thread Jaromír Doleček

Le mar. 1 juin 2021 à 18:35, Rhialto  a écrit :
> I re-tried the same thing (almost the same thing; the partition was
> smaller) with an amd64/9.2 install in an OpenStack VM: extracting the
> pkgsrc tar file using sysinst. It hung before finishing.
>
> So there is either some general disk I/O problem, or it is specific to
> virtio disks (which seems more likely, so far).
>
> I could get the libvirt xml description and/or the qemu command line, in
> case it would provide useful. I did post the dmesg from the -current
> kernel elsewhere in the thread.

Is there any way to get kernel backtrace? It would be strange if it's
virtio unless there is some missed interrupt.

The 4GB RAM should be fine, it's 32MB machines which might have problems :D

Jaromir

Re: posix_spawn issue?

2021-05-01 Thread Jaromír Doleček

Le sam. 1 mai 2021 à 14:25, Martin Husemann  a écrit :
>
> On Sat, May 01, 2021 at 01:13:43PM +0200, Thomas Klausner wrote:
> > The whole file is here:
> >
> > https://git.savannah.gnu.org/cgit/make.git/tree/src/job.c
>
> But since it mostly works (and only fails in some environments) there
> must be something special ongoing, and we will have to find out what that
> is. For that we need the concrete invocation it is trying to execute
> and see why it fails.
>
> Maybe it is the shell fallback and default_shell is wrong?
>
> Nothing really obvious in that file, everything quite normal posix_spawn()
> operataions (and we have ATF tests for those, execept POSIX_SPAWN_RESETIDS).

Well, there is one obvious difference - gmake uses
POSIX_SPAWN_USEVFORK - glibc supports it, we don't.

Jaromir

Re: posix_spawn issue?

2021-05-01 Thread Jaromír Doleček

Le sam. 1 mai 2021 à 11:15, Martin Husemann  a écrit :
>
> On Sat, May 01, 2021 at 11:02:26AM +0200, Thomas Klausner wrote:
> > gmake since version 4.3 uses posix_spawn(), but that breaks the build
> > of firefox (and libreoffice). Disabling posix_spawn() support in gmake
> > works around this problem.[1]
> >
> > Is there a bug/incompatibility in our posix_spawn() or is there a bug
> > in gmake?
>
> Hard to tell from the data available.
>
> We need a smaller test case reproducing the issue - debugging it in the
> firefox build is not very practical.

Maybe one of the good steps would be getting more than 'error 127'
from gmake for it.

I'd expect whatever incompatibility to be with an unsupported/wrongly
working attributes, or file actions. Does gmake do anything odd for
them?

Jaromir

Re: make fails to build on linux

2021-04-17 Thread Jaromír Doleček

Le sam. 17 avr. 2021 à 19:49, Manuel Bouyer  a écrit :
>
> On Sat, Apr 17, 2021 at 07:25:58PM +0200, Manuel Bouyer wrote:
> > Hello
> > trying a build.sh tools on linux I got:
> > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:
> > In function '__regex_wctype':
> > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:254:2:
> >  error: 'for' loop initial declarations are only allowed in C99 mode
> >   for (size_t i = 0; i < __arraycount(wctypes); i++) {
> > ^
> > /dsk/l1/misc/bouyer/HEAD/clean/src/tools/compat/../../lib/libc/regex/regcomp.c:2
> > 54:2: note: use option -std=c99 or -std=gnu99 to compile your code
> >
> > What is the right fix for this ?
> >
> > For now I just moved the declaration outside of the loop
>
> Well, the build fails later with the same error.
> Using "-V HOST_CFLAGS=-std=gnu99" allows the tools to build; maybe
> this should be the default ?

I think it would be sensible to use -std=c99 by default, yes. It's
strange that the Linux toolchain refuses it by default, do we force
some other -std flag by default now by chance?

Jaromir

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-14 Thread Jaromír Doleček

Le mer. 14 avr. 2021 à 03:21, Greg A. Woods  a écrit :
> However their front-end code does detect it and seems to make use of it,
> and has done for some 6 years now according to "git blame" (with no
> recent fixes beyond fixing a memory leak on their end).  Here we see it
> live from FreeBSD's sysctl output, thus my concern that this feature may
> be the source of the problem:

You can test if this is the problem by disabling the feature in
negotiation in NetBSD xbdback.c - comment out the code which sets
feature-max-indirect-segments in xbdback_backend_changed(). With the
feature disabled, FreeBSD DomU should not use indirect segments.

Jaromir

Re: Status of COMPAT_LINUX and Linux emulation?

2020-09-02 Thread Jaromír Doleček

COMPAT_LINUX works as well as always, and will continue working the
same. Presence in GENERIC does not change how reliable it is now or in
future. There are no plans to remove the actual code, the option as
well as the kernel module will continue working.

Going forward using the kernel module is probably a better option, you
can use the standard distributed sets without needing to have a custom
kernel that way.

There were numerous bugs found in various parts of the compat code
that prompted this change. The removal from GENERIC intends to reduce
the attack surface of the default distributed kernel. Thus any future
discovered problem would only affect the people who explicitly enable
COMPAT_LINUX on their systems.

Jaromir

Le mer. 2 sept. 2020 à 15:44, Thomas Mueller  a écrit :
>
> I noticed that COMPAT_LINUX was removed from GENERIC kernel configuration 
> file but still could be used.
>
> So far, I left it in my custom kernel config, figuring it would do no harm 
> when not used, and might possibly be useful under certain circumstances.
>
> I would guess that running Linux binaries under NetBSD (and64 or i386) would 
> be very unreliable since it was removed from GENERIC.
>
> Has experience running Linux programs in NetBSD been generally negative or 
> high-risk?
>
> FreeBSD has Linuxulator which seems to be doing at least fairly well and in 
> no danger of being removed any time soon.
>
> Tom
>

Re: Missing "something" in nvme config messages

2020-07-28 Thread Jaromír Doleček

I changed the message, now it should say 'ld at nvme0 nsid 1 not configured'

Le mar. 28 juil. 2020 à 15:08, Paul Goyette  a écrit :
>
> If you have nvme configure, but do NOT have ``ld* at nvme?'' you
> get an error message with something missing:
>
> ...
> [ 7.616966] nvme0: for io queue 6 interrupting at msix6 vec 6 affinity to 
> cpu5
> [ 7.616966] nvme0: for io queue 7 interrupting at msix6 vec 7 affinity to 
> cpu6
> [ 7.616966] at nvme0 nsid 1 not configured
> ...
>
> Note that the "not configured" message doesn't say what is not
> configured!
>
> Sidebar:  It might seem "pointless" to configure the nvme device but not
> configure any ld children.  But it actually makes sense when both device
> drivers are loaded as modules.  When the ld module is eventually loaded,
> it correctly displays the attach message:
>
> [ 7.636966] ld0 at nvme0 nsid 1
>
>
> :)
>
>
> ++--+---+
> | Paul Goyette   | PGP Key fingerprint: | E-mail addresses: |
> | (Retired)  | FA29 0E3B 35AF E8AE 6651 | p...@whooppee.com |
> | Software Developer | 0786 F758 55DE 53BA 7731 | pgoye...@netbsd.org   |
> ++--+---+

Re: xentools413 build failure

2020-07-24 Thread Jaromír Doleček

This is now fixed, the header should again be installed on build distribution.

Thanks for the report.

Jaromir

Le mer. 22 juil. 2020 à 22:04, Jaromír Doleček
 a écrit :
>
> Let me recheck this, I removed the header on current since there
> didn't seem to be any use for it in xen 4.11. Apparently I overlooked
> something.
>
> Le mer. 22 juil. 2020 à 21:26, Chavdar Ivanov  a écrit :
> >
> > Hi,
> >
> > Under -current amd64 xentools 4.13.1 fail to build as follows:
> > 
> > gcc -I/usr/pkg/include -I/usr/include -I/usr/pkg/include/python3.7
> > -I/usr/pkg/include/glib-2.0 -I/usr/pkg/include/gio-unix-2.0
> > -I/usr/pkg/lib/glib-2.0/include -I/usr/X11R7/include
> > -D_XOPEN_SOURCE_EXTENDED=1 -I/usr/pkg/include/ncurses -DPIC -O2
> > -I/usr/pkg/include -I/usr/include -I/usr/pkg/include/python3.7
> > -I/usr/pkg/include/glib-2.0 -I/usr/pkg/include/gio-unix-2.0
> > -I/usr/pkg/lib/glib-2.0/include -I/usr/X11R7/include
> > -D_XOPEN_SOURCE_EXTENDED=1 -I/usr/pkg/include/ncurses -m64 -DBUILD_ID
> > -fno-strict-aliasing -std=gnu99 -Wall -Wstrict-prototypes
> > -Wdeclaration-after-statement -Wno-unused-but-set-variable
> > -Wno-unused-local-typedefs   -m64 -DBUILD_ID -fno-strict-aliasing
> > -std=gnu99 -Wall -Wstrict-prototypes  -Wdeclaration-after-statement
> > -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> > -fomit-frame-pointer
> > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> > .subdirs-all.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
> > -Wstrict-prototypes  -Wdeclaration-after-statement
> > -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> > -fomit-frame-pointer
> > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> > .subdir-all-libs.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99
> > -Wall -Wstrict-prototypes  -Wdeclaration-after-statement
> > -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> > -fomit-frame-pointer
> > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> > .subdirs-all.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
> > -Wstrict-prototypes  -Wdeclaration-after-statement
> > -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> > -fomit-frame-pointer
> > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> > .subdir-all-gnttab.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99
> > -Wall -Wstrict-prototypes  -Wdeclaration-after-statement
> > -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> > -fomit-frame-pointer
> > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> > .build.d   -Werror -Wmissing-prototypes -I./include
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
> >  
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/libs/toollog/include
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
> >  
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/libs/toolcore/include
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
> > -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
> > -Wstrict-prototypes  -Wdeclaration-after-statement
> > -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> > -fomit-frame-pointer
> > -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> > .netbsd.opic.d   -Werror -Wmissing-prototypes -I./include
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
> >  
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/libs/toollog/include
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
> >  
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/libs/toolcore/include
> > -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
> >  -fPIC -c -o netbsd.opic netbsd.c
> > netbsd.c:32:10: fatal error: xen/xenio.h: No such file or directory
> >  #include 
> >   ^
> > compilation terminated.
> >
> > 
> >
> > Any idea? I have the same package built on the 5th of July:
> >
> >  35288534 Jul  5 11:14 /usr/pkgsrc/packages/All/xentools413-4.13.1.tgz
> >
> > Chavdar
> >
> >
> > --
> >

Re: xentools413 build failure

2020-07-22 Thread Jaromír Doleček

Let me recheck this, I removed the header on current since there
didn't seem to be any use for it in xen 4.11. Apparently I overlooked
something.

Le mer. 22 juil. 2020 à 21:26, Chavdar Ivanov  a écrit :
>
> Hi,
>
> Under -current amd64 xentools 4.13.1 fail to build as follows:
> 
> gcc -I/usr/pkg/include -I/usr/include -I/usr/pkg/include/python3.7
> -I/usr/pkg/include/glib-2.0 -I/usr/pkg/include/gio-unix-2.0
> -I/usr/pkg/lib/glib-2.0/include -I/usr/X11R7/include
> -D_XOPEN_SOURCE_EXTENDED=1 -I/usr/pkg/include/ncurses -DPIC -O2
> -I/usr/pkg/include -I/usr/include -I/usr/pkg/include/python3.7
> -I/usr/pkg/include/glib-2.0 -I/usr/pkg/include/gio-unix-2.0
> -I/usr/pkg/lib/glib-2.0/include -I/usr/X11R7/include
> -D_XOPEN_SOURCE_EXTENDED=1 -I/usr/pkg/include/ncurses -m64 -DBUILD_ID
> -fno-strict-aliasing -std=gnu99 -Wall -Wstrict-prototypes
> -Wdeclaration-after-statement -Wno-unused-but-set-variable
> -Wno-unused-local-typedefs   -m64 -DBUILD_ID -fno-strict-aliasing
> -std=gnu99 -Wall -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .subdirs-all.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
> -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .subdir-all-libs.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99
> -Wall -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .subdirs-all.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
> -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .subdir-all-gnttab.d   -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99
> -Wall -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .build.d   -Werror -Wmissing-prototypes -I./include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
>  
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/libs/toollog/include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
>  
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/libs/toolcore/include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
> -m64 -DBUILD_ID -fno-strict-aliasing -std=gnu99 -Wall
> -Wstrict-prototypes  -Wdeclaration-after-statement
> -Wno-unused-but-set-variable -Wno-unused-local-typedefs   -O2
> -fomit-frame-pointer
> -D__XEN_INTERFACE_VERSION__=__XEN_LATEST_INTERFACE_VERSION__ -MMD -MF
> .netbsd.opic.d   -Werror -Wmissing-prototypes -I./include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
>  
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/libs/toollog/include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
>  
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/libs/toolcore/include
> -I/usr/pkgsrc/sysutils/xentools413/work/xen-4.13.1/tools/libs/gnttab/../../../tools/include
>  -fPIC -c -o netbsd.opic netbsd.c
> netbsd.c:32:10: fatal error: xen/xenio.h: No such file or directory
>  #include 
>   ^
> compilation terminated.
>
> 
>
> Any idea? I have the same package built on the 5th of July:
>
>  35288534 Jul  5 11:14 /usr/pkgsrc/packages/All/xentools413-4.13.1.tgz
>
> Chavdar
>
>
> --
>

Re: Build error on amd64 -current

2020-06-27 Thread Jaromír Doleček

Fixed in rev.1.137 of sys/arch/xen/x86/cpu.c

Le sam. 27 juin 2020 à 11:42, Jaromír Doleček
 a écrit :
>
> I'll fix it.
>
> Le sam. 27 juin 2020 à 11:39, Andreas Gustafsson  a écrit :
> >
> > Paul Goyette wrote:
> > > With up-to-date sources I'm getting
> > >
> > > /build/netbsd-compat/src_ro/sys/arch/xen/x86/cpu.c: In function 
> > > 'mp_cpu_start':
> > > /build/netbsd-compat/src_ro/sys/arch/xen/x86/cpu.c:999:1: error: stack 
> > > usage is5408 bytes [-Werror=stack-usage=]
> > >   mp_cpu_start(struct cpu_info *ci, vaddr_t target)
> > >   ^~~~
> >
> > It started with this commit:
> >
> >   2020.06.25.14.52.26 jdolecek src/sys/conf/Makefile.kern.inc 1.274
> >
> >   enable gcc stack usage limit for kernel functions, set to 3.5 KiB for now
> >   as that seems to be enough to accomodate the current biggest stack usages
> >
> >   there are about six functions which use over 3KiB local stack, and
> >   about a dozen between 2-3 KiB, so pushing this further needs more work
> >   if desired
> >
> >   compile tested on amd64, i386, sparc64, sparc, powerpc (evbppc - BookE),
> >   m68k (mac68k)
> >
> > --
> > Andreas Gustafsson, g...@gson.org

Re: Build error on amd64 -current

2020-06-27 Thread Jaromír Doleček

I'll fix it.

Le sam. 27 juin 2020 à 11:39, Andreas Gustafsson  a écrit :
>
> Paul Goyette wrote:
> > With up-to-date sources I'm getting
> >
> > /build/netbsd-compat/src_ro/sys/arch/xen/x86/cpu.c: In function 
> > 'mp_cpu_start':
> > /build/netbsd-compat/src_ro/sys/arch/xen/x86/cpu.c:999:1: error: stack 
> > usage is5408 bytes [-Werror=stack-usage=]
> >   mp_cpu_start(struct cpu_info *ci, vaddr_t target)
> >   ^~~~
>
> It started with this commit:
>
>   2020.06.25.14.52.26 jdolecek src/sys/conf/Makefile.kern.inc 1.274
>
>   enable gcc stack usage limit for kernel functions, set to 3.5 KiB for now
>   as that seems to be enough to accomodate the current biggest stack usages
>
>   there are about six functions which use over 3KiB local stack, and
>   about a dozen between 2-3 KiB, so pushing this further needs more work
>   if desired
>
>   compile tested on amd64, i386, sparc64, sparc, powerpc (evbppc - BookE),
>   m68k (mac68k)
>
> --
> Andreas Gustafsson, g...@gson.org

Re: ZFS disaster on -current

2020-06-24 Thread Jaromír Doleček

By chance, do you have the kernel crash dump from the original panic
which happened yesterday? The subsequent ones might be a result of the
first one.

The messages about redzone don't mean anything beyond that there is no
overflow protection for items on the pool.

Jaromir

Le mer. 24 juin 2020 à 11:34, Chavdar Ivanov  a écrit :
>
> Hi,
>
> On
>
> NetBSD ymir 9.99.68 NetBSD 9.99.68 (GENERIC) #1: Tue Jun 23 22:53:46
> BST 2020  
> sysbuild@ymir:/home/sysbuild/amd64/obj/home/sysbuild/src/sys/arch/amd64/compile/GENERIC
> amd64
>
> I suddenly got a panic with ZFS; it took place with the previous
> kernel, so it was something with the module. In single user I disabled
> zfs in /etc/rc.conf and was able to complete boot, but obviously
> without my two pools.
>
> 'modload solaris' didn't show any problem.
>
> I set aside the contents of /etc/zfs and did 'modload zfs', which resulted in:
>
> .
>
> WARNING: ZFS on NetBSD is under development
> pool redzone disabled for 'zio_buf_4096'
> pool redzone disabled for 'zio_data_buf_4096'
> pool redzone disabled for 'zio_buf_8192'
> pool redzone disabled for 'zio_data_buf_8192'
> pool redzone disabled for 'zio_buf_16384'
> pool redzone disabled for 'zio_data_buf_16384'
> pool redzone disabled for 'zio_buf_32768'
> pool redzone disabled for 'zio_data_buf_32768'
> pool redzone disabled for 'zio_buf_65536'
> pool redzone disabled for 'zio_data_buf_65536'
> pool redzone disabled for 'zio_buf_131072'
> pool redzone disabled for 'zio_data_buf_131072'
> pool redzone disabled for 'zio_buf_262144'
> pool redzone disabled for 'zio_data_buf_262144'
> pool redzone disabled for 'zio_buf_524288'
> pool redzone disabled for 'zio_data_buf_524288'
> pool redzone disabled for 'zio_buf_1048576'
> pool redzone disabled for 'zio_data_buf_1048576'
> pool redzone disabled for 'zio_buf_2097152'
> pool redzone disabled for 'zio_data_buf_2097152'
> pool redzone disabled for 'zio_buf_4194304'
> pool redzone disabled for 'zio_data_buf_4194304'
> pool redzone disabled for 'zio_buf_8388608'
> pool redzone disabled for 'zio_data_buf_8388608'
> pool redzone disabled for 'zio_buf_16777216'
> pool redzone disabled for 'zio_data_buf_16777216'
>
> I have no idea what that means, it is a first for me, ZFS otherwise
> has been very reliable on this hardware so far, inasmuch as I have the
> mercurial repo on a zfs and build from it from time to time (the panic
> is from the last cvs update from yesterday, though).
>
> Subsequent 'zpool import' repeated the panic (without getting me into
> the debugger, though):
>
>
> ZFS filesystem version: 5
> uvm_fault(0xa97e4c3e1610, 0x0, 1) -> e
> fatal page fault in supervisor mode
> trap type 6 code 0 rip 0x81d49882 cs 0x8 rflags 0x10286 cr2
> 0xa0 ilevel 0 rsp 0xde819c16d760
> curlwp 0xa97e3a41e140 pid 17394.17394 lowest kstack 0xde819c16a2c0
> panic: trap
> cpu0: Begin traceback...
> vpanic() at netbsd:vpanic+0x152
> snprintf() at netbsd:snprintf
> startlwp() at netbsd:startlwp
> alltraps() at netbsd:alltraps+0xc3
> vdev_open() at zfs:vdev_open+0x9e
> vdev_open_children() at zfs:vdev_open_children+0x39
> vdev_root_open() at zfs:vdev_root_open+0x33
> vdev_open() at zfs:vdev_open+0x9e
> spa_load() at zfs:spa_load+0x38e
> spa_tryimport() at zfs:spa_tryimport+0x86
> zfs_ioc_pool_tryimport() at zfs:zfs_ioc_pool_tryimport+0x41
> zfsdev_ioctl() at zfs:zfsdev_ioctl+0x8c1
> nb_zfsdev_ioctl() at zfs:nb_zfsdev_ioctl+0x38
> VOP_IOCTL() at netbsd:VOP_IOCTL+0x44
> vn_ioctl() at netbsd:vn_ioctl+0xa5
> sys_ioctl() at netbsd:sys_ioctl+0x550
> syscall() at netbsd:syscall+0x26e
> --- syscall (number 54) ---
> netbsd:syscall+0x26e:
> cpu0: End traceback...
>
> The above panic did not leave a crash dump.
>
> When I had /etc/zfs populated before, I also got a crash dump (with
> 'reboot 0x104'), as follows:
>
> # crash -M netbsd.18.core -N netbsd.18
> Crash version 9.99.68, image version 9.99.68.
> crash: _kvm_kvatop(0)
> Kernel compiled without options LOCKDEBUG.
> System panicked: reboot forced via kernel debugger
> Backtrace from time of crash is available.
> crash> bt
> _KERNEL_OPT_NARCNET() at 0
> _KERNEL_OPT_NARCNET() at 0
> sys_reboot() at sys_reboot
> db_fncall() at db_fncall
> db_command() at db_command+0x127
> db_command_loop() at db_command_loop+0xa6
> db_trap() at db_trap+0xe6
> kdb_trap() at kdb_trap+0xe1
> trap() at trap+0x2b7
> --- trap (number 6) ---
> vdev_disk_open.part.4() at vdev_disk_open.part.4+0x49a
> vdev_open() at vdev_open+0x9e
> vdev_open_children() at vdev_open_children+0x39
> vdev_root_open() at vdev_root_open+0x33
> vdev_open() at vdev_open+0x9e
> spa_load() at spa_load+0x38e
> spa_load_best() at spa_load_best+0x58
> spa_open_common() at spa_open_common+0xc2
> pool_status_check.part.25() at pool_status_check.part.25+0x1e
> zfsdev_ioctl() at zfsdev_ioctl+0x80e
> nb_zfsdev_ioctl() at nb_zfsdev_ioctl+0x38
> VOP_IOCTL() at VOP_IOCTL+0x44
> vn_ioctl() at vn_ioctl+0xa5
> sys_ioctl() at sys_ioctl+0x550
> syscall() at syscall+0x26e
> ---

Re: qemu emulated machine crashes due to disk timeouts

2020-05-15 Thread Jaromír Doleček

Le ven. 15 mai 2020 à 15:53, Jonathan A. Kollasch
 a écrit :
>
> On Sat, May 02, 2020 at 12:02:45PM +1000, Paul Ripke wrote:
> > Since I have my qemu disk images on slow spinning rust host disks, when the
> > host disk is busy (esp. daily+security runs), I find my qemu vm's see disk
> > timeouts, and end up crashing. This isn't great behaviour.
>
> Timeout issue aside, crashing because of it is a bug in the error
> handling code somewhere.

Yes, this is root cause of the panic:

panic: LOCKDEBUG: Mutex error: mutex_vector_enter,509: assertion
failed: !cpu_intr_p()
[ 13493.1166960] cpu0: Begin traceback...
[ 13493.1166960] vpanic() at netbsd:vpanic+0x178
[ 13493.1166960] snprintf() at netbsd:snprintf
[ 13493.1166960] lockdebug_more() at netbsd:lockdebug_more
[ 13493.1166960] mutex_enter() at netbsd:mutex_enter+0x656
[ 13493.1166960] suspendsched() at netbsd:suspendsched+0x19
[ 13493.1166960] cpu_reboot() at netbsd:cpu_reboot+0x46
[ 13493.1166960] sys_reboot() at netbsd:sys_reboot
[ 13493.1166960] vpanic() at netbsd:vpanic+0x181
[ 13493.1166960] snprintf() at netbsd:snprintf
[ 13493.1166960] startlwp() at netbsd:startlwp
[ 13493.1166960] alltraps() at netbsd:alltraps+0xc3
[ 13493.1166960] wdc_ata_bio_start() at netbsd:wdc_ata_bio_start+0xcbd
[ 13493.1166960] ata_xfer_start() at netbsd:ata_xfer_start+0x4f
[ 13493.1166960] wdc_ata_bio_intr() at netbsd:wdc_ata_bio_intr+0x3b9
[ 13493.1166960] wdcintr() at netbsd:wdcintr+0x10a
[ 13493.1166960] intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x36
[ 13493.1166960] Xhandle_ioapic_edge3() at netbsd:Xhandle_ioapic_edge3+0x6d

I're-recheck the path, something seems to still fool the code to run
handling in interrupt context when it things it's in process context.

Jaromir

Re: qemu emulated machine crashes due to disk timeouts

2020-05-14 Thread Jaromír Doleček

Le jeu. 14 mai 2020 à 15:19,  a écrit :
> > Bumped ATA_DELAY to 3 (was 1), and the VM stayed up overnight,
> > only logging the one correctable soft error:
> >
> > May  7 04:19:29 qemu /netbsd: [ 16290.3345912] autoconfiguration error: 
> > piixide0:0:0: lost interrupt
> > May  7 04:19:29 qemu /netbsd: [ 16290.3345912]  type: ata tc_bcount: 512 
> > tc_skip: 0
> > May  7 04:19:29 qemu /netbsd: [ 16290.3345912] autoconfiguration error: 
> > piixide0:0:0: bus-master DMA error: missing interrupt, status=0x21
> > May  7 04:19:29 qemu /netbsd: [ 16290.6088515] wd0a: DMA error writing fsbn 
> > 1801813 (wd0 bn 1801876; cn 879 tn 52 sn 20), xfer 38, retry 0
> > May  7 04:19:29 qemu /netbsd: [ 16292.6053372] wd0: soft error (corrected) 
> > xfer 38
> >
> > Would making ATA_DELAY configurable via options(4) be worth it?
>
> QEMU emulation isn't a niche setup, we should aim to have it work out of
> the box without adjusting sysctls, IMO.

I agree. Let's fix QEMU IDE emulation?

Seriously though I think that it wouldn't hurt to just bump ATA_DELAY
to 30 seconds by default.

Also as another general suggestion - you'll generally get more
performance by using the PV drivers i.e. virtio, or ahcisata
emulation.

Jaromir

Re: panic in ciss.c

2020-04-17 Thread Jaromír Doleček

Le ven. 17 avr. 2020 à 22:38,  a écrit :
> This was the first time.  this machine survived successful complete source 
> builds and pkgsrc builds with -current up to 9.99.49.  50, 52 and 55 paniced 
> somewhere else. 9.99.56 survived much longer than  these, and then this one 
> showed up.

Looking at the sources for ciss, this seems like a forgotten debugging
KASSERT() - it looks like that normally happens if the polled command
times out.

Having said that, the code handling the condition after the KASSERT()
seems to be wrong anyway, ciss_done() fails to process the result as
error since (ccb_cmd.id & CISS_CMD_ERR) is not set, and also even if
that worked, ciss_error() expects ccb_err.cmd_stat to be set to
little-endian value.

Jaromir

Re: umass

2020-04-14 Thread Jaromír Doleček

Le mar. 14 avr. 2020 à 15:11, Patrick Welche  a écrit :
>
> I just plugged in my USB SD card reader to a box with a new kernel and:
>
> ugen0: Generic (0x058f) Mass Storage Device (0x6366), rev 2.00/1.00, addr 1
>
> No SD...
>
> On NetBSD 9.99.52, it used to look like:
>
> Apr  2 20:12:19 quantz /netbsd: [ 110784.6752445] umass0: Generic (0x058f) 
> Mass Storage Device (0x6366), rev 2.00/1.00, addr 1
> Apr  2 20:12:19 quantz /netbsd: [ 110784.6752445] umass0: using SCSI over 
> Bulk-Only

Are you sure you have umass* in your kernel config, and that you re-run config?

Mine with -current:

[ 2.630264] umass0 at uhub0 port 6 configuration 1 interface 0
[ 2.630264] umass0: SanDisk (0x0781) Ultra Fit (0x5583), rev
3.00/1.00, addr 1
[ 2.630264] umass0: using SCSI over Bulk-Only
[ 2.630264] scsibus0 at umass0: 2 targets, 1 lun per target
...

SCSI over Bulk-Only should be unaffected by the removal of ISD-ATA support.

Jaromir

Re: Automated report: NetBSD-current/i386 build failure

2020-04-10 Thread Jaromír Doleček

Fix committed. The compiler was correct - I reused code around
frame_list from setup, didn't notice the type is different.

Thank yo.

Jaromir

Le ven. 10 avr. 2020 à 22:35, Robert Elz  a écrit :
>
> Date:Thu,  9 Apr 2020 22:24:47 + (UTC)
> From:NetBSD Test Fixture 
> Message-ID:  <158647108721.6125.11585167398565454...@babylon5.netbsd.org>
>
>   | This is an automatically generated notice of a NetBSD-current/i386
>   | build failure.
>
> The i386 build remains broken since this commit:
>
>   | 2020.04.09.19.26.38 jdolecek src/sys/arch/xen/xen/xengnt.c,v 1.31
>
> The problem is when compiling i386_PAE and is in the new function:
>
>  static int
>  xengnt_map_status(void)
>
> in this line:
>
>  set_xen_guest_handle(getstatus.frame_list, pages);
>
> which gcc diagnoses as assigning a pointer of one type to a
> pointer of a different tye (u_long * being assigned to unsigned long long *).
>
> While I have issues with this kind of thing being treated as an error,
> it looks as if it might be easy to fix.
>
> "pages" is the u_long * - and looks as if it should be a paddr_t *
>
> When it is dereferenced later in the code, the value is cast to a paddr_t
>
> This would perhaps be a real bug, as (as best I understand the x86
> architecture) a u_long is not big enough to hold a paddr_t on 386 PAE.
> (Only perhaps, as it looks as if it is really storing page numbers,
> not paddr_t's at all, but ...)
>
> Changing "pages" to be paddr_t * instead of u_long * seems to fix the
> i386 builds (I also deleted the now redundant cast on the dereference
> in the same function - whether there are more elsewhere I didn't look.)
>
> I am not going to commit this, as I don't understand what is happening
> well enough, and I also haven't tested to confirm that the amd64 builds
> still work with that change made.
>
> Jaromir could you check this, and commit a fix so as to make the i386 builds
> work again?
>
> There was a similar problem in a debug printf earlier, that was "fixed" by
> turning off XEN_DEBUG (so the code isn't being compiled) - but it was a
> related problem, with printf formats (attempting to print a paddr_t as
> a pointer I(%p) I believe in that case (with a cast, but paddr_t's cannot
> be transformed into pointers directly on PAE systems, I think,)
>
> kre
>

Re: Panic during reboot on amd64 9.99.52

2020-03-27 Thread Jaromír Doleček

Le ven. 27 mars 2020 à 20:33, Paul Goyette  a écrit :
>
> With a curent built from sources updated just a few hours ago (on
> 2020-03-27 at 16:13:55 UTC), I get a panic during shutdown.  The
> stack trace doesn't seem to be saved, but I manually transcribed
> it:
>
> vpanic + 0x178
> kern_assert + 0x48
> config_detach + 0x65
> mii_detach + 0x109
> wm_detach + 0xb0
> config_detach + 0xe5
> config_detach_all + 0x97
> cpu_reboot + 0x198
> sys_reboot
> sys_reboot + 0x63
> syscall + 0x299
> (syscall #208)
>
> Unfortunately, the actual panic message had scrolled off the screen,
> but it included "bad device fstate".
>
> The only mii on my machine should be
>
> ihphy0 at wm0 phy 2: i217 10/100/1000 media interface, rev. 5
>
> and is configured as
>
> ihphy0 at mii? phy ?
>
> Anyone got any ideas?

On my machine I see something related, but no panic:

[ 4334.0100134] ihphy0: detached
...
[ 4334.0200130] config_detach: ihphy0 is already detached
[ 4334.0300132] wm0: detached

It seems something happened with mii detach code, maybe it doesn't
properly run the parent detach hook.

Jaromir

Re: panic after DRM & nouveau MSI changes

2020-02-13 Thread Jaromír Doleček

Le jeu. 13 févr. 2020 à 17:28, John D. Baker
 a écrit :
>
> On Wed, 12 Feb 2020, JaromÃr DoleÄ~Mek wrote:
>
> > I've just committed a fix for the MSI interrupt allocation for
> > nouveau, can you try it on the system which had the trouble with blank
> > console?
>
> Instead of a blank screen, the display simply freezes in VGA text mode
> when nouveaufb0 attaches.  See my latest addendum to kern/52440.

I see the problem happens on G84 and G92, but not G64.

Are all the cards with MSI problem by chance the Tesla family, i.e NV50?

Those are: GeForce 8, GeForce 9, GeForce 100, GeForce 200, GeForce 300

Jaromir

Re: panic after DRM & nouveau MSI changes

2020-02-12 Thread Jaromír Doleček

Le mer. 12 févr. 2020 à 03:38, John D. Baker
 a écrit :
> I've checked this with plain GENERIC and one with the above file rolled
> back.  With plain generic, nouveau, intel, and radeon all work and have
> MSI off.  With the modified GENERIC, nouveau systems work (MSI off) and
> intel and radeon systems work (MSI on).

I've just committed a fix for the MSI interrupt allocation for
nouveau, can you try it on the system which had the trouble with blank
console?

It needs rev. 1.7 of
sys/external/bsd/drm2/dist/drm/nouveau/nvkm/subdev/pci/nouveau_nvkm_subdev_pci_base.c

Jaromir

Re: panic after DRM & nouveau MSI changes

2020-02-11 Thread Jaromír Doleček

Le ven. 7 févr. 2020 à 23:22, John D. Baker  a écrit :
> I see that the MSI changes have been reverted.  I have not yet updated
> my sources and still have the MSI-enabled sources plus proposed nouveau
> patch applied in my kernels.
>
> FWIW, quick tests with known-to-work radeon and intel graphics devices
> show that they are quite happy with MSI.  It's only NVIDIA/nouveau
> devices that are having trouble.

I've now got a nVidia card which triggers the problem on my rig x86
system, will look into the nouveau-specific panic over next week or
so.

If it's something simple let's fix it and enable MSI unconditionally,
if it would be something difficult then it would be reasonable to add
some conditionals to enable MSI only for non-nouveau.

Jaromir

Re: Panic on multiuser boot after dhcpcd with msk(4)

2020-02-06 Thread Jaromír Doleček

Le ven. 7 févr. 2020 à 00:35, Jason Thorpe  a écrit :
>
>
> > On Feb 7, 2020, at 12:40 AM, Jason Thorpe  wrote:
> >
> > Actually, I just got confirmation from nick that my patch fixes the 
> > problem, so I’ll check it in shortly.
>
> Ok, this should be fixed now.  Please let me know if you encounter any 
> problems with it.

It now panics in wm with LOCKDEBUG

Starting dhcpcd.
wm0: waiting for carrier
[  11.0863868] Mutex error: mutex_vector_enter,509: assertion failed:
!cpu_intr_p()

[  11.0863868] lock address : 0x8332929eee80 type : sleep/adaptive
[  11.0863868] initialized  : 0x80521618
[  11.0863868] shared holds :  0 exclusive:  0
[  11.0863868] shares wanted:  0 exclusive:  0
[  11.0863868] relevant cpu :  0 last held:  1
[  11.0863868] relevant lwp : 0x833292a15040 last held: 00
[  11.0863868] last locked  : 0x8052189d unlocked*: 0x804ac7a4
[  11.0863868] owner field  : 00 wait/spin:0/0
[  11.0863868] Turnstile: no active turnstile for this lock.

[  11.1618248] panic: LOCKDEBUG: Mutex error: mutex_vector_enter,509:
assertion failed: !cpu_intr_p()
[  11.1618248] cpu0: Begin traceback...
[  11.1618248] vpanic() at netbsd:vpanic+0x146
[  11.1618248] snprintf() at netbsd:snprintf
[  11.1618248] lockdebug_more() at netbsd:lockdebug_more
[  11.1618248] mutex_enter() at netbsd:mutex_enter+0x656
[  11.1618248] workqueue_enqueue() at netbsd:workqueue_enqueue+0x8f
[  11.1618248] if_link_state_change() at netbsd:if_link_state_change+0x132
[  11.1618248] mii_phy_update() at netbsd:mii_phy_update+0x84
[  11.1618248] ihphy_service() at netbsd:ihphy_service+0x9d
[  11.1618248] mii_pollstat() at netbsd:mii_pollstat+0x3e
[  11.1618248] wm_linkintr() at netbsd:wm_linkintr+0x221
[  11.1618248] wm_intr_legacy() at netbsd:wm_intr_legacy+0x12b
[  11.1618248] intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
[  11.1618248] Xhandle_ioapic_edge23() at netbsd:Xhandle_ioapic_edge23+0x6d
[  11.1618248] --- interrupt ---
[  11.1618248] x86_mwait() at netbsd:x86_mwait+0xd
[  11.1618248] acpicpu_cstate_idle_enter() at
netbsd:acpicpu_cstate_idle_enter+0xd1
[  11.1618248] acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0xba
[  11.1618248] idle_loop() at netbsd:idle_loop+0x136
[  11.1618248] cpu0: End traceback...
[  11.1618248] fatal breakpoint trap in supervisor mode
[  11.1618248] trap type 1 code 0 rip 0x8021e3a5 cs 0x8 rflags
0x202 cr2 0x73b04ecd7740 ilevel 0x8 rsp 0x8400ac728a28
[  11.1618248] curlwp 0x833292a15040 pid 0.2 lowest kstack
0x8400ac7242c0
Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x5:  leave
db{0}>

Panic on multiuser boot after dhcpcd with msk(4)

2020-02-06 Thread Jaromír Doleček

Hi,

I get repeatable panic with uptodate -current when system goes multiuser:

Adding interface aliases:.
Waiting for DAD to complete for statically configured addresses...
[  10.9328087] wm0: link state UP (was DOWN)
Starting dhcpcd.
[  12.8156398] msk0: link state DOWN (was UNKNOWN)
[  12.8156398] panic: assert_sleepable: softint caller=0x8051119e
[  12.8301938] cpu0: Begin traceback...
[  12.8301938] vpanic() at netbsd:vpanic+0x146
[  12.8361159] snprintf() at netbsd:snprintf
[  12.8361159] assert_sleepable() at netbsd:assert_sleepable+0xbf
[  12.8456876] percpu_foreach() at netbsd:percpu_foreach+0x1f
[  12.8456876] if_stats_to_if_data() at netbsd:if_stats_to_if_data+0x5f
[  12.8579148] if_export_if_data() at netbsd:if_export_if_data+0x15
[  12.8671028] rt_ifmsg() at netbsd:rt_ifmsg+0x68
[  12.8671028] if_link_state_change_softint() at
netbsd:if_link_state_change_softint+0xc0
[  12.8765414] if_link_state_change_si() at netbsd:if_link_state_change_si+0x73
[  12.8876237] softint_dispatch() at netbsd:softint_dispatch+0x345
address 0xce80ac72e0b8 is invalid
address 0xce80ac72e0b0 is invalid
address 0xce80ac72e0c0 is invalid
address 0xce80ac72e0b8 is invalid
address 0xce80ac72e0c8 is invalid
address 0xce80ac72e0c0 is invalid
address 0xce80ac72e0d0 is invalid
address 0xce80ac72e0c8 is invalid
[  12.9158149] DDB lost frame for netbsd:Xsoftintr+0x4f, trying
0xce80ac72dff0
[  12.9258978] Xsoftintr() at netbsd:Xsoftintr+0x4f
[  12.9258978] --- interrupt ---
address 0xce80ac72e0c8 is invalid
address 0xce80ac72e080 is invalid
[  12.9382836] 0:
[  12.9468182] cpu0: End traceback...
[  12.9468182] fatal breakpoint trap in supervisor mode
[  12.9468182] trap type 1 code 0 rip 0x8021e3a5 cs 0x8 rflags
0x202 cr2 0x7ea43f14f000 ilevel 0x6 rsp 0xce80ac72dc50
[  12.9676785] curlwp 0xef76c9615480 pid 0.3 lowest kstack
0xce80ac7292c0
Stopped in pid 0.3 (system) at  netbsd:breakpoint+0x5:  leave
db{0}>

Any ideas what could be wrong? Maybe msk(4) is using some interface wrongly?

Jaromir

Re: ifstats changes not yet applied to plip(4)?

2020-02-04 Thread Jaromír Doleček

This should be fixed in if_plip.c rev. 1.36 now.

Jaromir

Le mar. 4 févr. 2020 à 21:01, John D. Baker  a écrit :
>
> Following this commit:
>
>   http://mail-index.netbsd.org/source-changes/2020/02/04/msg113679.html
>
> kernels with ppbus(4) and plip(4) still fail, but there are fewer complaints
> about missing "struct ifnet" members:
>
> [...]
> /x/current/src/sys/dev/ppbus/if_plip.c: In function 'lp_intr':
> /x/current/src/sys/dev/ppbus/if_plip.c:639:8: error: 'struct ifnet' has no 
> member named 'if_iqdrops'; did you mean 'if_addrlist'?
>ifp->if_iqdrops++;
> ^~
> if_addrlist
> /x/current/src/sys/dev/ppbus/if_plip.c:646:8: error: 'struct ifnet' has no 
> member named 'if_iqdrops'; did you mean 'if_addrlist'?
>ifp->if_iqdrops++;
> ^~
> if_addrlist
> /x/current/src/sys/dev/ppbus/if_plip.c:650:7: error: 'struct ifnet' has no 
> member named 'if_ipackets'; did you mean 'if_stats'?
>   ifp->if_ipackets++;
>^~~
>if_stats
> /x/current/src/sys/dev/ppbus/if_plip.c:651:7: error: 'struct ifnet' has no 
> member named 'if_ibytes'; did you mean 'if_index'?
>   ifp->if_ibytes += len;
>^
>if_index
> /x/current/src/sys/dev/ppbus/if_plip.c: In function 'lpoutput':
> /x/current/src/sys/dev/ppbus/if_plip.c:727:8: error: 'struct ifnet' has no 
> member named 'if_noproto'; did you mean 'if_softc'?
>ifp->if_noproto++;
> ^~
> if_softc
> *** [if_plip.o] Error code 1
>
>
> Thanks.
>
> --
> |/"\ John D. Baker, KN5UKS   NetBSD Darwin/MacOS X
> |\ / jdbaker[snail]consolidated[flyspeck]net  OpenBSDFreeBSD
> | X  No HTML/proprietary data in email.   BSD just sits there and works!
> |/ \ GPGkeyID:  D703 4A7E 479F 63F8 D3F4  BD99 9572 8F23 E4AD 1645

Re: vm.ubc_direct

2019-12-03 Thread Jaromír Doleček

Le mar. 3 déc. 2019 à 18:59, Chuck Silvers  a écrit :

> On Mon, Dec 02, 2019 at 07:10:52PM +, Andrew Doran wrote:
> > Hello,
> >
> > In light of the recent discussion, and having asked Jaromir his thoughts
> on
> > the subject, we both think it's time to enable this by default, so it
> gets
> > wider testing.  Is there a good reason not to?
> >
> > Cheers,
> > Andrew
>
> The current ubc_direct code still has the problem that I pointed out
> originally,
> which is that it deadlocks if you read() or write() a page of a file into
> a mapping of itself.  We should not enable this by default until that
> problem
> is fixed.
>

Right, I completely forgot about this.

I have a small program which triggers the deadlock quite reliably. Never
got around to actually add it into test suite because it caused problems
also on other system I run it on.

Andrew, would you by chance be interested to look at this?

Jaromir

Re: vm.ubc_direct

2019-12-02 Thread Jaromír Doleček

Can you send the dmesg and log of the recoverable I/O errors?

Jaromir

Le mer. 20 nov. 2019 à 15:04, Robert Nestor  a écrit :

> I tried enabling this option on my amd64 system running a fairly recent
> version of -current off a new SSD.  While building some packages I noticed
> a lot of recoverable disk I/O errors (mainly on writes) on the SSD disk.
> After disabling this option and doing a similar set of package build I
> didn’t see any recoverable disk I/O errors.  I didn’t do any further
> testing, but can if someone wants to look into this and needs additional
> information.

Re: adding support for a possibly unsupported M.2 harddrive?

2019-11-24 Thread Jaromír Doleček

Le dim. 24 nov. 2019 à 12:18, ng0  a écrit :

> Hi folx,
>
> I have an M.2 SSD for which I have to assume no support exists so far
> in NetBSD 9.99.17.
> This is an "TREKSTOR M.2 SSD-Modul 64 GB" bought in 2018.
>
> Its dmesg:
>
> [ 3.739718] wd1 at atabus1 drive 0
> [ 3.739718] wd1: <>
> [ 3.739718] wd1: drive supports 1-sector PIO transfers, LBA48
> addressing
> [ 3.739718] wd1: 61057 MB, 124053 cyl, 16 head, 63 sec, 512 bytes/sect
> x 125045424 sectors
> [ 3.739718] wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode
> 6 (Ultra/133), WRITE DMA FUA, NCQ (32 tags)
> [ 3.739718] wd1(ahcisata0:1:0): using PIO mode 4, DMA mode 2,
> Ultra-DMA mode 6 (Ultra/133) (using DMA), NCQ (31 tags)
>
> With fdisk I can see an earlier partition I created on another
> system, but making any changes to partitioning etc pp operations
> on disk fail (I can reproduce the information how it fails).
>

It seems it's configured to attach as AHCI instead of NVMe. Can you check
if there are any relevant BIOS settings, which would make it available as
NVMe?

Nevertheless, even when attached via AHCI it shouldn't give errors. Can you
please post specific errors from kernel when you try to do the partitioning?

Jaromir

Re: Increases in build system time

2019-11-14 Thread Jaromír Doleček

Le jeu. 14 nov. 2019 à 21:41, Christos Zoulas  a
écrit :

> In article <24013.43646.552099.15...@guava.gson.org>,
> Andreas Gustafsson   wrote:
> >
> >Hi all,
> >
> >Back in September, I wrote:
>
> >  12% increase:
> >
> >2019.03.08.20.35.10 christos src/share/mk/bsd.own.mk 1.1108
> >
> >Back to using jemalloc for x86_64; all problems have been resolved.
>
> Indeed I would expect the new jemalloc to do the same or better not
> so much worse. Perhaps it has to do with TLS? Or some poor tuning/default?
> I will look into it.
>

I wonder also if we could try enabling vm.ubc_direct on the build machine?

Jaromir

Re: tar extract changed since netbsd-8? (extracting sets over running system)

2019-11-12 Thread Jaromír Doleček

Le mar. 12 nov. 2019 à 12:05, Martin Husemann  a écrit :

> Not seen this locally, but that would be the switch to bsd/libarchive tar

Re: firefox52 quits on netbsd-9 w/i915drmkmsfb

2019-11-05 Thread Jaromír Doleček

Le mar. 5 nov. 2019 à 20:14, John D. Baker  a
écrit :

> [...]
> i965: Failed to submit batchbuffer: Input/output error
> assertion "pthread__tsd_destructors[key] != NULL" failed: file
> "/x/netbsd-9/src/lib/libpthread/pthread_tsd.c", line 177, function
> "pthread__add_specific"
>
>
It seems like something called pthread_key_delete() while/before the
pthread_setspecific() call.

I believe the code there should be changed to simply return without doing
anything if specified key doesn't exist, and the assert() removed.

Jaromir

Re: mac68k kern-INSTALL vs GCC7?

2019-02-15 Thread Jaromír Doleček

Le ven. 15 févr. 2019 à 17:33, John D. Baker
 a écrit :
>
> Building for mac68k with -V HAVE_GCC=7 produces the following error:
>
> /x/current/src/sys/arch/mac68k/mac68k/intr.c:135:2: note: in expansion of 
> macro 'memcpy'
>   memcpy(g_inames, inames, MAX_INAME_LENGTH);
>   ^~
> cc1: all warnings being treated as errors
> *** [intr.o] Error code 1
> nbmake[2]: stopped in 
> /r0/build/current/obj/mac68k/sys/arch/mac68k/compile/INSTALL
> 1 error
> nbmake[2]: stopped in 
> /r0/build/current/obj/mac68k/sys/arch/mac68k/compile/INSTALL
> [...]

The code there boilds down to memcpy(, "somestring", 53);
which while actually being (upon inspection) fine, it is very
difficult to parse and compiler warns correctly.

Maybe something like this?

https://www.netbsd.org/~jdolecek/mac68k_intr_gcc7.diff

Jaromir

Re: Recent USB changes broke kernel memory allocation

2019-02-10 Thread Jaromír Doleček

Fixed now. If you update the tree to have sys/dev/usb/umass.c rev.
1.174 you'll get the fixed files.

Jaromir

Le dim. 10 févr. 2019 à 19:31, Tom Ivar Helbekkmo
 a écrit :
>
> It seems that changes made to USB code on February 7th broke the kernel
> memory allocation arena.  After that point, it is enough to insert a USB
> memory stick into my amd64 laptop, and then remove it, to make the
> kernel crash.  It seems the changes to the allocating and freeing calls
> got a bit messed up, leading to internal disagreements about item sizes,
> at least in the umass code:
>
> : dejah# ;cd /var/crash
> : dejah# ;dmesg -N netbsd.26 -M netbsd.26.core | tail -23
> [  1525.390177] umass0: SMI Corporation (0x90c) USB DISK (0x1000), rev 
> 2.00/11.00, addr 2
> [  1525.390177] umass0: using SCSI over Bulk-Only
> [  1525.390177] scsibus0 at umass0: 2 targets, 1 lun per target
> [  1525.660323] sd0 at scsibus0 target 0 lun 0:  
> disk removable
> [  1525.660323] sd0: 3864 MB, 7872 cyl, 16 head, 63 sec, 512 bytes/sect x 
> 7913472 sectors
> [  1537.266612] sd0: detached
> [  1537.266612] scsibus0: detached
> [  1537.266612] panic: kmem_free(0x8412b3188208, 8) != allocated size 472
> [  1537.266612] cpu1: Begin traceback...
> [  1537.266612] vpanic() at netbsd:vpanic+0x16f
> [  1537.266612] snprintf() at netbsd:snprintf
> [  1537.266612] kmem_alloc() at netbsd:kmem_alloc
> [  1537.266612] umass_detach() at netbsd:umass_detach+0xe1
> [  1537.266612] config_detach() at netbsd:config_detach+0x121
> [  1537.266612] usb_disconnect_port() at netbsd:usb_disconnect_port+0xb8
> [  1537.266612] uhub_explore() at netbsd:uhub_explore+0x221
> [  1537.266612] usb_discover.isra.2() at netbsd:usb_discover.isra.2+0x68
> [  1537.266612] usb_event_thread() at netbsd:usb_event_thread+0x77
> [  1537.266612] cpu1: End traceback...
>
> [  1537.266612] dumping to dev 0,1 (offset=1472, size=1045482):
> [  1537.266612] dump
> : dejah# ;gdb netbsd.gdb
> GNU gdb (GDB) 8.0.1
> Copyright (C) 2017 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64--netbsd".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
> .
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from netbsd.gdb...done.
> (gdb) target kvm netbsd.26.core
> 0x80222d75 in cpu_reboot (howto=howto@entry=260, 
> bootstr=bootstr@entry=0x0)
> at /usr/src/sys/arch/amd64/amd64/machdep.c:726
> 726 dumpsys();
> (gdb) bt
> #0  0x80222d75 in cpu_reboot (howto=howto@entry=260, 
> bootstr=bootstr@entry=0x0)
> at /usr/src/sys/arch/amd64/amd64/machdep.c:726
> #1  0x809ec2c7 in vpanic (fmt=fmt@entry=0x813f8838 
> "kmem_free(%p, %zu) != allocated size %zu",
> ap=ap@entry=0x84806a1d5d78) at /usr/src/sys/kern/subr_prf.c:335
> #2  0x809ec35e in panic (fmt=fmt@entry=0x813f8838 
> "kmem_free(%p, %zu) != allocated size %zu")
> at /usr/src/sys/kern/subr_prf.c:254
> #3  0x809e1944 in kmem_size_check (sz=8, p=0x8412b3188200) at 
> /usr/src/sys/kern/subr_kmem.c:549
> #4  kmem_intr_free (p=0x8412b3188200, requested_size=8) at 
> /usr/src/sys/kern/subr_kmem.c:337
> #5  0x8047d794 in umass_detach (self=, flags=1) at 
> /usr/src/sys/dev/usb/umass.c:844
> #6  0x809d337b in config_detach (dev=dev@entry=0x8412a6f78908, 
> flags=flags@entry=1)
> at /usr/src/sys/kern/subr_autoconf.c:1748
> #7  0x804697df in usb_disconnect_port 
> (up=up@entry=0x84129e303210, parent=,
> flags=flags@entry=1) at /usr/src/sys/dev/usb/usb_subr.c:1665
> #8  0x8046a3a2 in uhub_explore (dev=0x84129e2fae20) at 
> /usr/src/sys/dev/usb/uhub.c:637
> #9  0x80463e47 in usb_discover (sc=, sc= out>) at /usr/src/sys/dev/usb/usb.c:1004
> #10 0x80463f0e in usb_event_thread (arg=0x84129e16bf68) at 
> /usr/src/sys/dev/usb/usb.c:562
> #11 0x802097c7 in lwp_trampoline ()
> #12 0x in ?? ()
> (gdb) up
> #1  0x809ec2c7 in vpanic (fmt=fmt@entry=0x813f8838 
> "kmem_free(%p, %zu) != allocated size %zu",
> ap=ap@entry=0x84806a1d5d78) at /usr/src/sys/kern/subr_prf.c:335
> 335 cpu_reboot(bootopt, NULL);
> (gdb) up
> #2  0x809ec35e in panic (fmt=fmt@entry=0x813f8838 
> "kmem_free(%p, %zu) != allocated size %zu")
> at /usr/src/sys/kern/subr_prf.c:254
> 254 vpanic(fmt, ap);
> (gdb) up
> #3  0x809e1944 in kmem_size_check (sz=8, p=0x8412b3188200) at 
>

Re: MSI/MSI-X implementation and interrupt handling on i386/amd64

2018-12-11 Thread Jaromír Doleček

Moving this to port-amd64 (bcced current-users@ for reference)

Le mar. 11 déc. 2018 à 04:34, Kengo NAKAHARA  a écrit :
> I mention some old Athlon 64 series (before socket AM2) do not support
> cmpxchg16b instruction. That would affect rewriting spllower to support
> 64 bit interrupt bitmask.

Indeed, need to do runtime check for cmpxchg16b support.

I'm investigating an initial solution which will keep the
ipending+ilevel still as 64-bit quantity suitable for cmpxchg8b, using
4 bits for the ilevel and remaining 60 for ipending.

The 60/4 split with cmpxchg8b should work for i386 too. Besides
spl.S/vector.S, so far I only found one place using atomic write on
the ipending - clearing interrupt from ipending in intr disestablish.
This one can easily change to just block all interrupts during the
clearing, since this is not performance critical path.

Jaromir

Re: MSI/MSI-X implementation and interrupt handling on i386/amd64

2018-12-10 Thread Jaromír Doleček

Le jeu. 6 déc. 2018 à 16:05, Cherry G.Mathew  a écrit :
> The right thing to do is to stop using a bit mask entirely, and using
> a bit more scalable Data structure for this. This isn't trivial though -
> the assembler stuff will be harder to maintain correctness than a
> straightup buslocked bitscan/compare etc.

What about just bumping this to 64 on amd64, where we have the 64-bit
atomic ops? While keeping i386 still on 32.

We seem to have already i386 and amd64 variants of the interrupt
assembler, so maybe not so bad that they would diverge further.

It would be nice to do something to bump the limit. If we have general
consensus is that this is worth doing, I can try to write something
and see how ugly/difficult it would become with 64bit bitmasks. I
don't feel like delving into rewriting this to use completely
different structure ...

Jaromir

Re: apu2 SATA patch

2018-11-26 Thread Jaromír Doleček

Committed this.
Le lun. 26 nov. 2018 à 22:28, Mike Pumford
 a écrit :
>
>
>
> On 26/11/2018 15:16, Greg Troxel wrote:
> > Mike Pumford  writes:
> >
> >> I have one of these. The msata needs needs a small patch (needs an
> >> entry in the quirks table to be properly recognised as an ahci
> >> controller) but other than that it seems to work. No stability issues
> >> using sdhc as the system disk.
> >
> > Could you mail a patch, or file a PR with it?  This seems like something
> > that should be applied in the main tree.
> >
> Heres the patch. I've not yet verified IO yet but the patch is based on
> how OpenBSD and FreeBSD handle the device and the detection messages
> match what they report.
>

Re: Panic in ahci_detach

2018-11-02 Thread Jaromír Doleček

Le jeu. 1 nov. 2018 à 06:38, Masanobu SAITOH  a écrit :
> The meaning of atac_nchannels changed or numbering of channel
> changed?

ahci_detach() counted improperly. Can you confirm rev. 1.66 of
dev/ic/ahcisata_core.c fixes the problem?

Jaromir

Re: Panic in ahci_detach

2018-11-01 Thread Jaromír Doleček

Le jeu. 1 nov. 2018 à 06:38, Masanobu SAITOH  a écrit :
> The meaning of atac_nchannels changed or numbering of channel
> changed?

No it hasn't, but ahcisata(4) used to not call the detach routine
until about a week ago, so the logic might be actually buggy.

I'll recheck it, stay tuned.

Jaromir

Re: Build fails for kernels w/cd(4) but w/o wd(4)

2018-10-26 Thread Jaromír Doleček

Le sam. 27 oct. 2018 à 00:50,  a écrit :
>
> On Fri, Oct 26, 2018 at 05:27:05PM -0500, John D. Baker wrote:
> > --- wdc.o ---
> > /x/current/src/sys/dev/ic/wdc.c:138:1: error: missing initializer for field 
> > 'ata_recovery' of 'const struct ata_bustype' 
> > [-Werror=missing-field-initializers]
> >  };
> >  ^
> > In file included from /x/current/src/sys/dev/ic/wdc.c:90:0:
> > /x/current/src/sys/dev/ata/atavar.h:376:9: note: 'ata_recovery' declared 
> > here
> >   void (*ata_recovery)(struct ata_channel *, int, uint32_t);
> >  ^~~~
>
> I assume the fix is this, to set it to NULL like it would be in the NWD > 0 
> case.
> I've converted it to a C99 initializer while there.

Yes. Please commit this fix, I'm away next couple of days.

Jaromir

> Index: dev/ic/wdc.c
> ===
> RCS file: /cvsroot/src/sys/dev/ic/wdc.c,v
> retrieving revision 1.289
> diff -u -r1.289 wdc.c
> --- dev/ic/wdc.c22 Oct 2018 20:13:47 -  1.289
> +++ dev/ic/wdc.c26 Oct 2018 22:38:35 -
> @@ -126,15 +126,16 @@
>  #else
>  /* A fake one, the autoconfig will print "wd at foo ... not configured */
>  const struct ata_bustype wdc_ata_bustype = {
> -   SCSIPI_BUSTYPE_ATA,
> -   NULL,   /* wdc_ata_bio */
> -   NULL,   /* wdc_reset_drive */
> -   wdc_reset_channel,
> -   wdc_exec_command,
> -   NULL,   /* ata_get_params */
> -   NULL,   /* wdc_ata_addref */
> -   NULL,   /* wdc_ata_delref */
> -   NULL/* ata_kill_pending */
> +   .bus_type = SCSIPI_BUSTYPE_ATA,
> +   .ata_bio =  NULL,
> +   .wdc_reset_drive =  NULL,
> +   .ata_reset_channel =wdc_reset_channel,
> +   .ata_exec_command = wdc_exec_command,
> +   .ata_get_params =   NULL,
> +   .ata_addref =   NULL,
> +   .ata_delref =   NULL,
> +   .ata_killpending =  NULL,
> +   .ata_recovery = NULL,
>  };
>  #endif
>
>

Re: Xen Domu kernel crash at start of boot

2018-06-22 Thread Jaromír Doleček

2018-06-22 2:54 GMT+02:00 Chuck Zmudzinski :
> I am getting a kernel crash almost immediately after booting the current
> kernel. I am running NetBSD/xen amd64 on a Debian Linux 8.10 DOM0 which uses
> Xen-4.4. Last week's kernel was good. I built a kernel from a cvs update a
> couple of days ago and tried it. It crashed. I tried the most recent daily
> snapshot available from NetBSD daily builds. It crashed too. Here is the
> information from the console about the daily snapshot kernel that crashed
> (it was built earlier today):
> [   1.000] vcpu0: Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz, id 0x306c3
> [   1.000] vcpu0: package 0, core 3, smt 0
> [   1.000] vcpu1 at hypervisor0
> [   1.000] vcpu1: Intel(R) Core(TM) i5-4590S CPU @ 3.00GHz, id 0x306c3
> [   1.000] vcpu1: package 0, core 3, smt 0
> [   1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface
> [   1.000] xencons0 at hypervisor0: Xen Virtual Console Driver
> [   1.030] fatal protection fault in supervisor mode
> [   1.030] trap type 4 code 0 rip 0x80205968 cs 0x1e030
> rflags 0x10046 cr2 0 ilevel 0 rsp 0xa000a570fbf0
> [   1.030] curlwp 0xa642d4a0 pid 0.15 lowest kstack
> 0xa000a570b2c0
> kernel: protection fault trap, code=0
> Stopped in pid 0.15 (system) at 80205968:   fxsavel

That would be my fault.

Can you send me moral equivalent of "cpuctl identify 0" from the DOM0?
I want to know what CPUID is saying about supported features on the
CPU.

Can you also check whether you use no-xsave flag for your DOM0 by
chance? It should not be needed on Intel CPUs.

Meanwhile this change in sys/arch/xen/x86/cpu.c can be done to avoid this:

@@ -551,7 +551,7 @@ cpu_init(struct cpu_info *ci)
  * does, here we only set CR4_OSXSAVE if the feature is already
  * enabled according to CPUID.
  */
- if (cpu_feature[1] & CPUID2_OSXSAVE)
+ if (0 && cpu_feature[1] & CPUID2_OSXSAVE)
  cr4 |= CR4_OSXSAVE;
  else {
  x86_xsave_features = 0;

The change was tested on Xen 4.2 and Xen 4.8, I wonder if Xen 4.4 has
yet another quirk. Any chance you could try your DOM0 updated to newer
Xen?

Jaromir

Re: Remove fortune quotes attributed to or providing admiration of Adolf Hitler [pr bin/52735]

2017-11-19 Thread Jaromír Doleček

I very strongly object to against anything appeasing any SJW or PC trolls,
and I'm against removing those quotes.

History needs to be remembered, and learnt from. Facts need to be told and
faced. It is a great threat to our modern society that certain groups of
people today are so intent on silencing dissent or unconfortable voices.
Certain that trend to remove free speach and towards totalitarian society
is much bigger present danger, than quoting these long dead people.

Jaromir

2017-11-18 15:51 GMT+01:00 Andy Ruhl :

> On Sat, Nov 18, 2017 at 5:41 AM, Chavdar Ivanov  wrote:
> > even if it is perhaps a proper quote, but is worth remembering and
> > reminding people.
>
> At the risk of not being politically correct, I agree.
>
> The world seems intent on not remembering all history and even
> changing parts of it to suit today's sensibilities. Which seems
> dangerous to me. "All of it" (history) is "how we got here" and should
> be learned from.
>
> I'm not necessarily against removing stuff that is offensive to more
> than a few people. But if there is value in remembering it to put it
> into modern context, then we should think about it a bit.
>
> Andy
>

Re: dump -X of large LVM based FFSv2 with WAPBL panics

2017-11-15 Thread Jaromír Doleček

Hi,

can you try if doing full forced fsck (fsck -f) would resolve this?

I've seen several such persistent panics when I was debugging WAPBL. Even
after kernel fixes I had persistent panics around ffs_newvnode() due to
disk data corruption from previous runs. This is worth trying.

Some day I plan to add some counter, so that actually boot would actually
force fsck every X boots even when clean, similarily what Linux does with
ext3/4.

Jaromir

2017-11-15 12:56 GMT+01:00 Matthias Petermann :

> Hello,
>
> on my system I have observed a serious panic when doing FFSv2 dumps under
> certain conditions. I did some googling on my own and found some references
> regarding the lead symptom
>
> "ffs_newvnode: ino=113 on /p: gen 55fd2f1f/55fd2f1f has non zero
> blocks ff00 or size 0"
>
> but all of them ended up as solved back in 2016. So I wanted to share my
> observation here, in the hope somebody can give me some pointers how the
> issue could be narrowed down further.
>
> 1) Given:
>
> - NetBSD 8.0_BETA (Kernel built from branches/netbsd-8 around 2017-11-06)
>
> NetBSD nuc.local 8.0_BETA NetBSD 8.0_BETA (XEN3_DOM0_XHCI) #0: Mon
> Nov 6 14:31:17 CET 2017 
> admin@nuc.local:/s/src/sys/arch/amd64/compile/XEN3_DOM0_XHCI
> amd64
>
> - A large (392 GB) LVM volume hosting a FFSv2 filesystem with WAPBL enabled
>   (/dev/mapper/vg0-photo mounted at /p)
>
> - (An external USB 3.0 Drive)
>
> 2) What I tried:
>
> - make a dump of the aforementioned filesystem, using snapshots
>
> # dump -X -0auf /mnt/photo.0.dump /p
>
> 3) What happens then:
>
> - the System crashes, leaving a coredump with with the following
> indication:
>
> ffs_newvnode: ino=113 on /p: gen 55fd2f1f/55fd2f1f has non zero blocks
> ff00 or size 0
> fatal page fault in supervisor mode
> trap type 6 code 0x2 rip 0x8022c0cc cs 0x8 rflags 0x10246 cr2
> 0xfe82deaddf1d ilevel 0x3 rsp 0xfe810e6b1eb8
> curlwp 0xfe827f736000 pid 0.4 lowest kstack 0xfe810e6ae2c0
> panic: trap
> cpu0: Begin traceback...
> vpanic() at netbsd:vpanic+0x140
> snprintf() at netbsd:snprintf
> trap() at netbsd:trap+0xc6b
> --- trap (number 6) ---
> mutex_enter() at netbsd:mutex_enter+0xc
> biodone2() at netbsd:biodone2+0x9b
> biodone2() at netbsd:biodone2+0x9b
> biointr() at netbsd:biointr+0x3a
> softint_dispatch() at netbsd:softint_dispatch+0xd3
> DDB lost frame for netbsd:Xsoftintr+0x4f, trying 0xfe810e6b1ff0
> Xsoftintr() at netbsd:Xsoftintr+0x4f
> --- interrupt ---
> 0:
> cpu0: End traceback...
>
> dumping to dev 0,1 (offset=168119, size=2076255):
> dump
>
> - gdb backtrace shows:
>
> (gdb) target kvm netbsd.3.core
> 0x80229545 in cpu_reboot ()
> (gdb) bt
> #0  0x80229545 in cpu_reboot ()
> #1  0x809a4afc in vpanic ()
> #2  0x809a4bb0 in panic ()
> #3  0x8022b176 in trap ()
> #4  0x8020113e in alltraps ()
> #5  0x8022c0cc in mutex_enter ()
> #6  0x80a029f5 in wapbl_biodone ()
> #7  0x809e2f20 in biodone2 ()
> #8  0x809e2f20 in biodone2 ()
> #9  0x809e303e in biointr ()
> #10 0x8097bc1d in softint_dispatch ()
> #11 0x80223eef in Xsoftintr ()
> (gdb)
>
> 4) What I tried afterwards:
>
> - make a dump of the aforementioned filesystem, using NO snapshots
>
> # dump -0auf /mnt/photo.0.dump /p
>
> -> works
>
> - umount the filesystem, enforcing a manual fsck
>
> -> no problems
>
> - dumpfs -s /dev/mapper/vg0-photo
>
> nuc# dumpfs -s /dev/mapper/vg0-photo
> file system: /dev/mapper/vg0-photo
> format  FFSv2
> endian  little-endian
> location 65536  (-b 128)
> magic   19540119timeWed Nov 15 12:26:52 2017
> superblock location 65536   id  [ 59f8026a 16319237 ]
> cylgrp  dynamic inodes  FFSv2   sblock  FFSv2   fslevel 5
> nbfree  4461561 ndir1865nifree  24770027nffree  2079
> ncg 530 size100663296   blocks  99102949
> bsize   32768   shift   15  mask0x8000
> fsize   4096shift   12  mask0xf000
> frag8   shift   3   fsbtodb 3
> bpg 23742   fpg 189936  ipg 46848
> minfree 5%  optim   timemaxcontig 2 maxbpg  4096
> symlinklen 120  contigsumsize 2
> maxfilesize 0x000800800805
> nindir  4096inopb   128
> avgfilesize 16384   avgfpdir 64
> sblkno  24  cblkno  32  iblkno  40  dblkno  2968
> sbsize  4096cgsize  32768
> csaddr  2968cssize  12288
> cgrotor 0   fmod0   ronly   0   clean   0x01
> wapbl version 0x1   location 2  flags 0x0
> wapbl loc0 402688128loc1 131072 loc2 512loc3 3
> flags   none
> fsmnt   /p
> volname swuid   0
>
> 5) Further

Re: netbsd 8 (beta) failing to load ixg device

2017-11-13 Thread Jaromír Doleček

I had a very brief look on the crashing function
ixgbe_update_stats_count(). The only division there is in the one using
adapter->num_queue.

Looking at ixgbe_configure_interrups(), seems that one can happily set it
to 0 if number of MSI vectors is 1, as is the case according to your dmesg.

SAITOH Masanobu, could you please have a closer look? I wonder if this
could have been introduced around rev. 1.96/1.97/1.98 of
dev/pci/ixgbe/ixgbe.c, that's when the code related to this changed.

Jaromir

2017-11-13 16:03 GMT+01:00 Derrick Lobo :
>
> HI Thor
>
> I have attached the logs..  I am able to use the server in 7.99 but cannot
> upgrade to 8.0 beta. Theres a kernel panic when the driver is loaded.. its
> not like the bootup progresses and marks the driver as unconfigured.
>
> -Original Message-
> From: Thor Lancelot Simon [mailto:t...@panix.com]
> Sent: Sunday, November 12, 2017 10:30 PM
> To: Derrick Lobo
> Cc: port-am...@netbsd.org; current-users@netbsd.org
> Subject: Re: netbsd 8 (beta) failing to load ixg device
>
> On Thu, Nov 09, 2017 at 09:15:53AM -0500, Derrick Lobo wrote:
> > The daily beta version of nebtsd 8 does not support ixg 5gb NIC's, the
> > support was enabled in 7.99
>
> That doesn't make sense - if it's in 7.anything, it's in 8.  When we cut
> the 8 branch, we move the version number on HEAD to 8.99.
>
> Thor

Re: dmesg spam: ahcisata0 port 1: active 2 is 0x40000001 tfd 0x2051

2017-11-12 Thread Jaromír Doleček

Okay, so it is triggered by a ATAPI command.

I've had concern that maybe code mishandle ATAPI failures, but the recovery
code seems to be okay as far as I can read it.

I've changed the message to not show unless debugging, so it should not
spam you again.

Jaromir

2017-11-10 23:03 GMT+01:00 Stefan Hertenberger <stefan@hertenberger.bayern>:

> Am Fri, 10 Nov 2017 18:43:19 +0100
> schrieb Jaromír Doleček <jaromir.dole...@gmail.com>:
>
> > TFES usually means an unsupported command was sent to the device, or a
> > command was sent while previous was finished.
> >
> > Could you please try to figure out what is the command sent to the cd0
> > device just before that error? It should be possible to turn the debug
> > messages on via the ahcdebug_mask variable in ddb, set it to
> > DEBUG_XFERS (0x02).
>
> Hope this helps
>
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 16384 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0 port 0 header 0x80008e01f020
> ahci_bio_complete bcount 32768 now 0
> ahci_ata_bio port 0 CI 0x0
> ahci_bio_start CI 0x0
> ahcisata0 port 0 tbl 0x80008e021200
> ahcisata0

Re: dmesg spam: ahcisata0 port 1: active 2 is 0x40000001 tfd 0x2051

2017-11-12 Thread Jaromír Doleček

TFES usually means an unsupported command was sent to the device, or a
command was sent while previous was finished.

Could you please try to figure out what is the command sent to the cd0
device just before that error? It should be possible to turn the debug
messages on via the ahcdebug_mask variable in ddb, set it to DEBUG_XFERS
(0x02).

It could also be asynchronous notification from the device, which gets
mishandled as error. I see both your and Martin's device supports them
(SNTF capability). If there is no preceding command, it could be this.

Jaromir

2017-11-10 14:21 GMT+01:00 Stefan Hertenberger :

>
>
> Am 10. November 2017 14:16:51 MEZ schrieb Martin Husemann <
> mar...@duskware.de>:
>>
>> On Fri, Nov 10, 2017 at 02:11:12PM +0100, Jaromír Dole?ek wrote:
>>
>>>  The TFD 0x2051 corresponds to Media Changed error, with DRDY/ERR status
>>>  bits. Is it possible this message appears every time you change media in
>>>  the cd0?
>>>
>>>  Alternatively, I imagine this could be generated if the drive is setup to
>>>  hibernate/suspend.
>>>
>>
>> For me it happens w/o any medium and never touching the drive.
>>
>> Martin
>>
>>
> For me too, starts while booting w/o interaction from me.
>
> Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet

Re: dmesg spam: ahcisata0 port 1: active 2 is 0x40000001 tfd 0x2051

2017-11-12 Thread Jaromír Doleček

The TFD 0x2051 corresponds to Media Changed error, with DRDY/ERR status
bits. Is it possible this message appears every time you change media in
the cd0?

Alternatively, I imagine this could be generated if the drive is setup to
hibernate/suspend.

Jaromir

2017-11-10 10:49 GMT+01:00 :

> Am 2017-11-10 10:22, schrieb Martin Husemann:
>
>> On Fri, Nov 10, 2017 at 10:11:43AM +0100, stefan@hertenberger.bayern
>> wrote:
>>
>>> Am 2017-11-10 09:57, schrieb Martin Husemann:
>>> > Can you show the dmesg part describing your ahcisata devices?
>>>
>>> ahcisata0 at pci0 dev 31 function 2: vendor 8086 product 1c03 (rev. 0x04)
>>> ahcisata0: interrupting at ioapic0 pin 19
>>> ahcisata0: 64-bit DMA
>>> ahcisata0: AHCI revision 1.30, 6 ports, 32 slots, CAP
>>> 0xef30ff45>> SSNTF,SNCQ,S64A>
>>> atabus0 at ahcisata0 channel 0
>>> atabus1 at ahcisata0 channel 1
>>>
>>
>> And the device at atabus1?
>>
>> Martin
>>
>
> The complete dmesg out put was attached on the first mail.
>
> Regards Stefan
>
> Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
> 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017
> The NetBSD Foundation, Inc.  All rights reserved.
> Copyright (c) 1982, 1986, 1989, 1991, 1993
> The Regents of the University of California.  All rights reserved.
>
> NetBSD 8.99.6 (GENERIC) #8: Thu Nov  9 23:16:37 UTC 2017
> root@packard:/usr/build/obj/sys/arch/amd64/compile/GENERIC
> total memory = 8043 MB
> avail memory = 7788 MB
> rnd: seeded with 128 bits
> timecounter: Timecounters tick every 10.000 msec
> Kernelized RAIDframe activated
> running cgd selftest aes-xts-256 aes-xts-512 done
> RTC BIOS diagnostic error 0x80
> timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
> Packard Bell EasyNote TS11HR (V1.20)
> mainbus0 (root)
> ACPI: RSDP 0x000FE020 24 (v02 ACRSYS)
> ACPI: XSDT 0x96FFE120 7C (v01 ACRSYS ACRPRDCT 0001
> 0113)
> ACPI: FACP 0x96FFC000 F4 (v04 ACRSYS ACRPRDCT 0001 1025
> 0004)
> ACPI: DSDT 0x96FF 008F53 (v01 ACRSYS ACRPRDCT  1025
> 0004)
> ACPI: FACS 0x96F6D000 40
> ACPI: ASF! 0x96FFD000 A5 (v32 ACRSYS ACRPRDCT 0001 1025
> 0004)
> ACPI: HPET 0x96FFB000 38 (v01 ACRSYS ACRPRDCT 0001 1025
> 0004)
> ACPI: APIC 0x96FFA000 8C (v02 ACRSYS ACRPRDCT 0001 1025
> 0004)
> ACPI: MCFG 0x96FF9000 3C (v01 ACRSYS ACRPRDCT 0001 1025
> 0004)
> ACPI: SLIC 0x96FEF000 000176 (v01 ACRSYS ACRPRDCT 0001 1025
> 0004)
> ACPI: SSDT 0x96FEE000 000BC2 (v01 ACRSYS ACRPRDCT 1000 1025
> 0004)
> ACPI: BOOT 0x96FEC000 28 (v01 ACRSYS ACRPRDCT 0001 1025
> 0004)
> ACPI: ASPT 0x96FE9000 34 (v07 ACRSYS ACRPRDCT 0001 1025
> 0004)
> ACPI: SSDT 0x96FE8000 00090C (v01 ACRSYS ACRPRDCT 3000 1025
> 0004)
> ACPI: SSDT 0x96FE7000 000996 (v01 ACRSYS ACRPRDCT 3000 1025
> 0004)
> ACPI: Executed 1 blocks of module-level executable AML code
> ACPI: 4 ACPI AML tables successfully acquired and loaded
> ioapic0 at mainbus0 apid 0: pa 0xfec0, version 0x20, 24 pins
> cpu0 at mainbus0 apid 0
> cpu0: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz, id 0x206a7
> cpu0: package 0, core 0, smt 0
> cpu1 at mainbus0 apid 1
> cpu1: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz, id 0x206a7
> cpu1: package 0, core 0, smt 1
> cpu2 at mainbus0 apid 2
> cpu2: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz, id 0x206a7
> cpu2: package 0, core 1, smt 0
> cpu3 at mainbus0 apid 3
> cpu3: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz, id 0x206a7
> cpu3: package 0, core 1, smt 1
> cpu4 at mainbus0 apid 4
> cpu4: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz, id 0x206a7
> cpu4: package 0, core 2, smt 0
> cpu5 at mainbus0 apid 5
> cpu5: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz, id 0x206a7
> cpu5: package 0, core 2, smt 1
> cpu6 at mainbus0 apid 6
> cpu6: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz, id 0x206a7
> cpu6: package 0, core 3, smt 0
> cpu7 at mainbus0 apid 7
> cpu7: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz, id 0x206a7
> cpu7: package 0, core 3, smt 1
> acpi0 at mainbus0: Intel ACPICA 20170831
> acpi0: X/RSDT: OemId , AslId <,0113>
> acpi0: MCFG: segment 0, bus 0-255, address 0xe000
> ACPI: Dynamic OEM Table Load:
> ACPI: SSDT 0xE4010E761810 00067C (v01 PmRef  Cpu0Cst  3001 INTL
> 20100121)
> ACPI: Dynamic OEM Table Load:
> ACPI: SSDT 0xE4025DC6F410 000303 (v01 PmRef  ApIst3000 INTL
> 20100121)
> ACPI: Dynamic OEM Table Load:
> ACPI: SSDT 0xE4010E6915D0 000119 (v01 PmRef  ApCst3000 INTL
> 20100121)
> acpi0: SCI interrupting at int 9
> timecounter: Timecounter "ACPI-Safe" frequency 3579545 Hz quality 900
> hpet0 at acpi0: high precision event timer (mem 0xfed0-0xfed00400)
> timecounter:

Re: New panic in wdc_ata_bio_intr

2017-10-17 Thread Jaromír Doleček

Not at the moment - on the end, I committed the version with flag.

Thanks for report and testing!

Jaromir

2017-10-16 21:12 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:

> Well, that was a good one. Running just fine now:
>
> ~ uname -a
> NetBSD nt61p.lorien.lan 8.99.4 NetBSD 8.99.4 (GENERIC) #1: Mon Oct 16
> 20:01:05 BST 2017  
> sysbu...@nt61p.lorien.lan:/home/sysbuild/src/sys/arch/amd64/compile/GENERIC
> amd64
> ~ dmesg | grep wd0
> wd0 at atabus0 drive 0
> wd0: 
> wd0: drive supports 16-sector PIO transfers, LBA48 addressing
> wd0: 298 GB, 620181 cyl, 16 head, 63 sec, 512 bytes/sect x 625142448
> sectors
> wd0: 32-bit data port
> wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100),
> NCQ (32 tags)
> wd0(piixide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using
> DMA)
> boot device: wd0
> root on wd0a dumps on wd0b
> ~ atactl wd0 identify
> Model: Hitachi HTS725032A9A364, Rev: PC3OCH0A, Serial #:
> 110320PCKC04BPJ53MLK
> World Wide Name: 5000CCA645DE827F
> Device type: ATA, fixed
> Capacity 320 Gbytes, 625142448 sectors, 512 bytes/sector
> Cylinders: 16383, heads: 16, sec/track: 63
> Command queue depth: 32
> Device capabilities:
> DMA
> LBA
> IORDY operation
> IORDY disabling
> Device supports following standards:
> ATA-2 ATA-3 ATA-4
> <https://maps.google.com/?q=3+ATA-4=gmail=g> ATA-5 ATA-6
> ATA-7 ATA-8
> Command set support:
> NOP command (enabled)
> READ BUFFER command (enabled)
> WRITE BUFFER command (enabled)
> Look-ahead (enabled)
> Write cache (enabled)
> Power Management feature set (enabled)
> Security Mode feature set (disabled)
> SMART feature set (enabled)
> FLUSH CACHE EXT command (enabled)
> FLUSH CACHE command (enabled)
> Device Configuration Overlay feature set (enabled)
> 48-bit Address feature set (enabled)
> Advanced Power Management feature set (enabled)
> DOWNLOAD MICROCODE command (enabled)
> World Wide Name
> General Purpose Logging feature set
> SMART self-test
> SMART error logging
> Serial ATA capabilities:
> 1.5Gb/s signaling
> 3.0Gb/s signaling
> Native Command Queuing
> PHY Event Counters
> Serial ATA features:
> DMA Setup Auto Activate (disabled)
> Device-Initiated Interface Power Managment (disabled)
> Software Settings Preservation (enabled)
>
> Anything else to test?
>
> Chavdar Ivanov
>
> On Mon, 16 Oct 2017 at 19:07 Jaromír Doleček <jaromir.dole...@gmail.com>
> wrote:
>
>> Okay, can you try following patch? It puts puts back a flag for IRQ
>> handling. If it works, I might have an idea what's happening. I think there
>> is some rogue interrupt disturbing the state.
>>
>> If it doesn't work, can you please try to compile kernel with ATADEBUG,
>> and set atadebug_mask (possibly via ddb during boot) to 0x40?
>>
>> Jaromir
>>
>> 2017-10-15 23:10 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:
>>
>>> Sorry, it still crashes the same way. I made sure all was updated before
>>> trying, I do have
>>>
>>> ident /netbsd  | grep wdc
>>>  $NetBSD: atapi_wdc.c,v 1.128 2017/10/10 21:37:49 jdolecek Exp $
>>>  $NetBSD: ata_wdc.c,v 1.108 2017/10/15 11:27:14 jdolecek Exp $
>>>  $NetBSD: wdc_isa.c,v 1.60 2017/10/07 16:05:32 jdolecek Exp $
>>>  $NetBSD: wdc_pcmcia.c,v 1.125 2017/10/07 16:05:33 jdolecek Exp $
>>>  $NetBSD: wdc.c,v 1.285 2017/10/15 18:02:33 jdolecek Exp $
>>>
>>> and the panic is exactly the same.
>>>
>>> I am sure I will sort out my problem on this particular machine if I
>>> swap the internal SSD and the one in the DVD bay, placing the NetBSD root
>>> in the proper place, but nevertheless the panic may indicate some other
>>> unfinished work, so I shall keep it as it is for testing.
>>>
>>> Chavdar Ivanov
>>>
>>> On Sun, 15 Oct 2017 at 19:03 Jaromír Doleček <jaromir.dole...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> should be fixed in rev. 1.285 of dev/ic/wdc.c, can you please check?
>>>>
>>>> Jaromir
>>>>
>>>> 2017-10-14 17:48 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:
>>>>
>>>>> It still panics the same way, no difference.
>>>>>
>>>>> On my other laptop, an HP EliteBook, I haven't the problem at all,
>>>>> only on th

Re: New panic in wdc_ata_bio_intr

2017-10-17 Thread Jaromír Doleček

Okay, can you try following patch? It puts puts back a flag for IRQ
handling. If it works, I might have an idea what's happening. I think there
is some rogue interrupt disturbing the state.

If it doesn't work, can you please try to compile kernel with ATADEBUG, and
set atadebug_mask (possibly via ddb during boot) to 0x40?

Jaromir

2017-10-15 23:10 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:

> Sorry, it still crashes the same way. I made sure all was updated before
> trying, I do have
>
> ident /netbsd  | grep wdc
>  $NetBSD: atapi_wdc.c,v 1.128 2017/10/10 21:37:49 jdolecek Exp $
>  $NetBSD: ata_wdc.c,v 1.108 2017/10/15 11:27:14 jdolecek Exp $
>  $NetBSD: wdc_isa.c,v 1.60 2017/10/07 16:05:32 jdolecek Exp $
>  $NetBSD: wdc_pcmcia.c,v 1.125 2017/10/07 16:05:33 jdolecek Exp $
>  $NetBSD: wdc.c,v 1.285 2017/10/15 18:02:33 jdolecek Exp $
>
> and the panic is exactly the same.
>
> I am sure I will sort out my problem on this particular machine if I swap
> the internal SSD and the one in the DVD bay, placing the NetBSD root in the
> proper place, but nevertheless the panic may indicate some other unfinished
> work, so I shall keep it as it is for testing.
>
> Chavdar Ivanov
>
> On Sun, 15 Oct 2017 at 19:03 Jaromír Doleček <jaromir.dole...@gmail.com>
> wrote:
>
>> Hi,
>>
>> should be fixed in rev. 1.285 of dev/ic/wdc.c, can you please check?
>>
>> Jaromir
>>
>> 2017-10-14 17:48 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:
>>
>>> It still panics the same way, no difference.
>>>
>>> On my other laptop, an HP EliteBook, I haven't the problem at all, only
>>> on the two T61p's (one of them stopped working a week ago, though).
>>>
>>> Chavdar Ivanov
>>>
>>>
>>> On Sat, 14 Oct 2017 at 15:45 Jaromír Doleček <jaromir.dole...@gmail.com>
>>> wrote:
>>>
>>>> Sorry, this fixed patch
>>>>
>>>> 2017-10-14 16:23 GMT+02:00 Jaromír Doleček <jaromir.dole...@gmail.com>:
>>>>
>>>>> Can you try attached patch?
>>>>>
>>>>> Jaromir
>>>>>
>>>>> 2017-10-11 1:04 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:
>>>>>
>>>>>> The timeouts when running under VirtualBox disappeared, but of course
>>>>>> the panic on my T61p remains.
>>>>>>
>>>>>> Chavdar Ivanov
>>>>>>
>>>>>> On Tue, 10 Oct 2017 at 22:40 Jaromír Doleček <
>>>>>> jaromir.dole...@gmail.com> wrote:
>>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> can you try with dev/scsipi/atapi_wdc.c 1.128? That should resolve
>>>>>>> the timeouts for atapi, at least it did for me.
>>>>>>>
>>>>>>> Jaromir
>>>>>>>
>>>>>>> 2017-10-10 8:08 GMT+02:00 Rares Aioanei <bsdlis...@gmail.com>:
>>>>>>>
>>>>>>>> I get that also on VBox, except it doesn't try to add cd0a as a swap
>>>>>>>> device, nor does it show an endless stream of "lost interrupt"
>>>>>>>> messages; eventually I get a login prompt. This is with yesterday's
>>>>>>>> latest -CURRENT.
>>>>>>>>
>>>>>>>> On Sun, Oct 8, 2017 at 5:17 PM, Chavdar Ivanov <ci4...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> > I tried the same kernel on a VirtualBox guest - it doesn't crash,
>>>>>>>> but one
>>>>>>>> > gets endless
>>>>>>>> >
>>>>>>>> > piixide0:1:0: lost interrupt
>>>>>>>> > type: atapi tc_bcount: 0 tc_skip: 0
>>>>>>>> >
>>>>>>>> > stream of messages. Also /etc/rc.d/swap2 start hangs while trying
>>>>>>>> to add
>>>>>>>> > /dev/cd0a as a dump device... as shown by ktruss.
>>>>>>>> >
>>>>>>>> > Weird.
>>>>>>>> >
>>>>>>>> > Chavdar
>>>>>>>> >
>>>>>>>> > On Sun, 8 Oct 2017 at 11:55 Chavdar Ivanov <ci4...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >> System updated about two hours ago. I am getting:
>>>>>>&

Re: New panic in wdc_ata_bio_intr

2017-10-16 Thread Jaromír Doleček

Hi,

should be fixed in rev. 1.285 of dev/ic/wdc.c, can you please check?

Jaromir

2017-10-14 17:48 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:

> It still panics the same way, no difference.
>
> On my other laptop, an HP EliteBook, I haven't the problem at all, only on
> the two T61p's (one of them stopped working a week ago, though).
>
> Chavdar Ivanov
>
>
> On Sat, 14 Oct 2017 at 15:45 Jaromír Doleček <jaromir.dole...@gmail.com>
> wrote:
>
>> Sorry, this fixed patch
>>
>> 2017-10-14 16:23 GMT+02:00 Jaromír Doleček <jaromir.dole...@gmail.com>:
>>
>>> Can you try attached patch?
>>>
>>> Jaromir
>>>
>>> 2017-10-11 1:04 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:
>>>
>>>> The timeouts when running under VirtualBox disappeared, but of course
>>>> the panic on my T61p remains.
>>>>
>>>> Chavdar Ivanov
>>>>
>>>> On Tue, 10 Oct 2017 at 22:40 Jaromír Doleček <jaromir.dole...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> can you try with dev/scsipi/atapi_wdc.c 1.128? That should resolve the
>>>>> timeouts for atapi, at least it did for me.
>>>>>
>>>>> Jaromir
>>>>>
>>>>> 2017-10-10 8:08 GMT+02:00 Rares Aioanei <bsdlis...@gmail.com>:
>>>>>
>>>>>> I get that also on VBox, except it doesn't try to add cd0a as a swap
>>>>>> device, nor does it show an endless stream of "lost interrupt"
>>>>>> messages; eventually I get a login prompt. This is with yesterday's
>>>>>> latest -CURRENT.
>>>>>>
>>>>>> On Sun, Oct 8, 2017 at 5:17 PM, Chavdar Ivanov <ci4...@gmail.com>
>>>>>> wrote:
>>>>>> > I tried the same kernel on a VirtualBox guest - it doesn't crash,
>>>>>> but one
>>>>>> > gets endless
>>>>>> >
>>>>>> > piixide0:1:0: lost interrupt
>>>>>> > type: atapi tc_bcount: 0 tc_skip: 0
>>>>>> >
>>>>>> > stream of messages. Also /etc/rc.d/swap2 start hangs while trying
>>>>>> to add
>>>>>> > /dev/cd0a as a dump device... as shown by ktruss.
>>>>>> >
>>>>>> > Weird.
>>>>>> >
>>>>>> > Chavdar
>>>>>> >
>>>>>> > On Sun, 8 Oct 2017 at 11:55 Chavdar Ivanov <ci4...@gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> System updated about two hours ago. I am getting:
>>>>>> >>
>>>>>> >> 
>>>>>> >> wd0 at atabus0 drive 0
>>>>>> >> wd0: 
>>>>>> >> wd0: drive supports 16-sector PIO transfers, LBA48 addressing
>>>>>> >> wd0: 298 GB, 620181 cyl, 16 head, 63 sec, 512 bytes/sect x
>>>>>> 625142448
>>>>>> >> sectors
>>>>>> >> piixide0:0:0: bad state 0 in wdc_ata_bio_intr
>>>>>> >> panic: wdc_ata_bio_intr: bad state
>>>>>> >> fatal breakpoint trap in supervisor mode
>>>>>> >> trap type 1 code 0 rip 0x8021c0c5 cs 0x8 rflags 0x246 cr2
>>>>>> 0 ilevel
>>>>>> >> 0x8 rsp 0xe40040003c38
>>>>>> >> curlwp 0xe4013bb27840 pid 0.2 lowest kstack 0xe40042c0
>>>>>> >> Stopped at pid 0.2 (system) at netbsd:breakpoint+0x5: leave
>>>>>> >> db{0}> bt
>>>>>> >> breakpoint() at netbsd:breakpoint+0x5
>>>>>> >> vpanic() at netbsd:vpanic+0x140
>>>>>> >> snprintf() at netbsd:snprintf
>>>>>> >> wdc_ata_bio_poll() at netbsd:wdc_ata_bio_poll
>>>>>> >> intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
>>>>>> >> Xintr_ioapic_edge10() at netbsd:Xintr_ioapic_edge10+0xee
>>>>>> >> --- interrupt ---
>>>>>> >> x86_mwait() at netbsd:x86_mwait+0xd
>>>>>> >> acpicpu_cstate_idel_enter() at netbsd:acpicpu_cstate_idle_
>>>>>> enter+0xdb
>>>>>> >> acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0xb6
>>>>>> >> idle_loop() at netbsd:idle_loop+0x18c
>>>>>> >> db{0}>
>>>>>> >> 
>>>>>> >>
>>>>>> >> (that is on my usual ThinkPad T61p).
>>>>>> >>
>>>>>> >> Couldn't get a crash dump.
>>>>>> >>
>>>>>> >> Chavdar Ivanov
>>>>>> >>
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>
>>

Re: New panic in wdc_ata_bio_intr

2017-10-16 Thread Jaromír Doleček

Sorry, this fixed patch

2017-10-14 16:23 GMT+02:00 Jaromír Doleček <jaromir.dole...@gmail.com>:

> Can you try attached patch?
>
> Jaromir
>
> 2017-10-11 1:04 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:
>
>> The timeouts when running under VirtualBox disappeared, but of course the
>> panic on my T61p remains.
>>
>> Chavdar Ivanov
>>
>> On Tue, 10 Oct 2017 at 22:40 Jaromír Doleček <jaromir.dole...@gmail.com>
>> wrote:
>>
>>> Hey,
>>>
>>> can you try with dev/scsipi/atapi_wdc.c 1.128? That should resolve the
>>> timeouts for atapi, at least it did for me.
>>>
>>> Jaromir
>>>
>>> 2017-10-10 8:08 GMT+02:00 Rares Aioanei <bsdlis...@gmail.com>:
>>>
>>>> I get that also on VBox, except it doesn't try to add cd0a as a swap
>>>> device, nor does it show an endless stream of "lost interrupt"
>>>> messages; eventually I get a login prompt. This is with yesterday's
>>>> latest -CURRENT.
>>>>
>>>> On Sun, Oct 8, 2017 at 5:17 PM, Chavdar Ivanov <ci4...@gmail.com>
>>>> wrote:
>>>> > I tried the same kernel on a VirtualBox guest - it doesn't crash, but
>>>> one
>>>> > gets endless
>>>> >
>>>> > piixide0:1:0: lost interrupt
>>>> > type: atapi tc_bcount: 0 tc_skip: 0
>>>> >
>>>> > stream of messages. Also /etc/rc.d/swap2 start hangs while trying to
>>>> add
>>>> > /dev/cd0a as a dump device... as shown by ktruss.
>>>> >
>>>> > Weird.
>>>> >
>>>> > Chavdar
>>>> >
>>>> > On Sun, 8 Oct 2017 at 11:55 Chavdar Ivanov <ci4...@gmail.com> wrote:
>>>> >>
>>>> >> System updated about two hours ago. I am getting:
>>>> >>
>>>> >> 
>>>> >> wd0 at atabus0 drive 0
>>>> >> wd0: 
>>>> >> wd0: drive supports 16-sector PIO transfers, LBA48 addressing
>>>> >> wd0: 298 GB, 620181 cyl, 16 head, 63 sec, 512 bytes/sect x 625142448
>>>> >> sectors
>>>> >> piixide0:0:0: bad state 0 in wdc_ata_bio_intr
>>>> >> panic: wdc_ata_bio_intr: bad state
>>>> >> fatal breakpoint trap in supervisor mode
>>>> >> trap type 1 code 0 rip 0x8021c0c5 cs 0x8 rflags 0x246 cr2 0
>>>> ilevel
>>>> >> 0x8 rsp 0xe40040003c38
>>>> >> curlwp 0xe4013bb27840 pid 0.2 lowest kstack 0xe40042c0
>>>> >> Stopped at pid 0.2 (system) at netbsd:breakpoint+0x5: leave
>>>> >> db{0}> bt
>>>> >> breakpoint() at netbsd:breakpoint+0x5
>>>> >> vpanic() at netbsd:vpanic+0x140
>>>> >> snprintf() at netbsd:snprintf
>>>> >> wdc_ata_bio_poll() at netbsd:wdc_ata_bio_poll
>>>> >> intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
>>>> >> Xintr_ioapic_edge10() at netbsd:Xintr_ioapic_edge10+0xee
>>>> >> --- interrupt ---
>>>> >> x86_mwait() at netbsd:x86_mwait+0xd
>>>> >> acpicpu_cstate_idel_enter() at netbsd:acpicpu_cstate_idle_enter+0xdb
>>>> >> acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0xb6
>>>> >> idle_loop() at netbsd:idle_loop+0x18c
>>>> >> db{0}>
>>>> >> 
>>>> >>
>>>> >> (that is on my usual ThinkPad T61p).
>>>> >>
>>>> >> Couldn't get a crash dump.
>>>> >>
>>>> >> Chavdar Ivanov
>>>> >>
>>>> >
>>>>
>>>
>>>
>
Index: ata_wdc.c
===
RCS file: /cvsroot/src/sys/dev/ata/ata_wdc.c,v
retrieving revision 1.107
diff -u -p -r1.107 ata_wdc.c
--- ata_wdc.c   8 Oct 2017 13:35:03 -   1.107
+++ ata_wdc.c   14 Oct 2017 14:45:17 -
@@ -618,14 +623,13 @@ wdc_ata_bio_poll(struct ata_channel *chp
 }
 
 static int
-wdc_ata_bio_intr(struct ata_channel *chp, struct ata_xfer *xfer, int is)
+wdc_ata_bio_intr(struct ata_channel *chp, struct ata_xfer *xfer, int irq)
 {
struct atac_softc *atac = chp->ch_atac;
struct wdc_softc *wdc = CHAN_TO_WDC(chp);
struct ata_bio *ata_bio = >c_bio;
struct ata_drive_datas *drvp = >ch_drive[xfer->c_drive];
int drv_err, tfd;
-   bool poll = ((xfer->c_flags & C_POLL) != 0);
 
ATADEBUG_PRINT(("wdc_ata_bio_intr %s:%d:%d\n",
device_xname(atac->atac_dev), chp->ch_channel, xfer->c_drive),
@@ -659,8 +663,9 @@ wdc_ata_bio_intr(struct ata_channel *chp
 #endif
 
/* Ack interrupt done by wdc_wait_for_unbusy */
-   if (wdc_wait_for_unbusy(chp, poll ? ATA_DELAY : 0, AT_POLL, ) < 0) {
-   if (!poll && (xfer->c_flags & C_TIMEOU) == 0) {
+   if (wdc_wait_for_unbusy(chp,
+   (irq == 0) ? ATA_DELAY : 0, AT_POLL, ) < 0) {
+   if (irq && (xfer->c_flags & C_TIMEOU) == 0) {
ata_channel_unlock(chp);
return 0; /* IRQ was not for us */
}

Re: New panic in wdc_ata_bio_intr

2017-10-16 Thread Jaromír Doleček

Can you try attached patch?

Jaromir

2017-10-11 1:04 GMT+02:00 Chavdar Ivanov <ci4...@gmail.com>:

> The timeouts when running under VirtualBox disappeared, but of course the
> panic on my T61p remains.
>
> Chavdar Ivanov
>
> On Tue, 10 Oct 2017 at 22:40 Jaromír Doleček <jaromir.dole...@gmail.com>
> wrote:
>
>> Hey,
>>
>> can you try with dev/scsipi/atapi_wdc.c 1.128? That should resolve the
>> timeouts for atapi, at least it did for me.
>>
>> Jaromir
>>
>> 2017-10-10 8:08 GMT+02:00 Rares Aioanei <bsdlis...@gmail.com>:
>>
>>> I get that also on VBox, except it doesn't try to add cd0a as a swap
>>> device, nor does it show an endless stream of "lost interrupt"
>>> messages; eventually I get a login prompt. This is with yesterday's
>>> latest -CURRENT.
>>>
>>> On Sun, Oct 8, 2017 at 5:17 PM, Chavdar Ivanov <ci4...@gmail.com> wrote:
>>> > I tried the same kernel on a VirtualBox guest - it doesn't crash, but
>>> one
>>> > gets endless
>>> >
>>> > piixide0:1:0: lost interrupt
>>> > type: atapi tc_bcount: 0 tc_skip: 0
>>> >
>>> > stream of messages. Also /etc/rc.d/swap2 start hangs while trying to
>>> add
>>> > /dev/cd0a as a dump device... as shown by ktruss.
>>> >
>>> > Weird.
>>> >
>>> > Chavdar
>>> >
>>> > On Sun, 8 Oct 2017 at 11:55 Chavdar Ivanov <ci4...@gmail.com> wrote:
>>> >>
>>> >> System updated about two hours ago. I am getting:
>>> >>
>>> >> 
>>> >> wd0 at atabus0 drive 0
>>> >> wd0: 
>>> >> wd0: drive supports 16-sector PIO transfers, LBA48 addressing
>>> >> wd0: 298 GB, 620181 cyl, 16 head, 63 sec, 512 bytes/sect x 625142448
>>> >> sectors
>>> >> piixide0:0:0: bad state 0 in wdc_ata_bio_intr
>>> >> panic: wdc_ata_bio_intr: bad state
>>> >> fatal breakpoint trap in supervisor mode
>>> >> trap type 1 code 0 rip 0x8021c0c5 cs 0x8 rflags 0x246 cr2 0
>>> ilevel
>>> >> 0x8 rsp 0xe40040003c38
>>> >> curlwp 0xe4013bb27840 pid 0.2 lowest kstack 0xe40042c0
>>> >> Stopped at pid 0.2 (system) at netbsd:breakpoint+0x5: leave
>>> >> db{0}> bt
>>> >> breakpoint() at netbsd:breakpoint+0x5
>>> >> vpanic() at netbsd:vpanic+0x140
>>> >> snprintf() at netbsd:snprintf
>>> >> wdc_ata_bio_poll() at netbsd:wdc_ata_bio_poll
>>> >> intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
>>> >> Xintr_ioapic_edge10() at netbsd:Xintr_ioapic_edge10+0xee
>>> >> --- interrupt ---
>>> >> x86_mwait() at netbsd:x86_mwait+0xd
>>> >> acpicpu_cstate_idel_enter() at netbsd:acpicpu_cstate_idle_enter+0xdb
>>> >> acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0xb6
>>> >> idle_loop() at netbsd:idle_loop+0x18c
>>> >> db{0}>
>>> >> 
>>> >>
>>> >> (that is on my usual ThinkPad T61p).
>>> >>
>>> >> Couldn't get a crash dump.
>>> >>
>>> >> Chavdar Ivanov
>>> >>
>>> >
>>>
>>
>>
Index: ata_wdc.c
===
RCS file: /cvsroot/src/sys/dev/ata/ata_wdc.c,v
retrieving revision 1.107
diff -u -p -r1.107 ata_wdc.c
--- ata_wdc.c   8 Oct 2017 13:35:03 -   1.107
+++ ata_wdc.c   14 Oct 2017 14:22:20 -
@@ -618,14 +623,13 @@ wdc_ata_bio_poll(struct ata_channel *chp
 }
 
 static int
-wdc_ata_bio_intr(struct ata_channel *chp, struct ata_xfer *xfer, int is)
+wdc_ata_bio_intr(struct ata_channel *chp, struct ata_xfer *xfer, int irq)
 {
struct atac_softc *atac = chp->ch_atac;
struct wdc_softc *wdc = CHAN_TO_WDC(chp);
struct ata_bio *ata_bio = >c_bio;
struct ata_drive_datas *drvp = >ch_drive[xfer->c_drive];
int drv_err, tfd;
-   bool poll = ((xfer->c_flags & C_POLL) != 0);
 
ATADEBUG_PRINT(("wdc_ata_bio_intr %s:%d:%d\n",
device_xname(atac->atac_dev), chp->ch_channel, xfer->c_drive),
@@ -659,8 +663,9 @@ wdc_ata_bio_intr(struct ata_channel *chp
 #endif
 
/* Ack interrupt done by wdc_wait_for_unbusy */
-   if (wdc_wait_for_unbusy(chp, poll ? ATA_DELAY : 0, AT_POLL, ) < 0) {
-   if (!poll && (xfer->c_flags & C_TIMEOU) == 0) {
+   if (wdc_wait_for_unbusy(chp,
+   (irq == 0) ? ATA_DELAY : 0, AT_POLL, ) < 0) {
+   if (!irq && (xfer->c_flags & C_TIMEOU) == 0) {
ata_channel_unlock(chp);
return 0; /* IRQ was not for us */
}

Re: HEADS-UP: SATA NCQ support merged (from jdolecek-ncq branch)

2017-10-10 Thread Jaromír Doleček

I've fixed the compilation for ALL kernels.

2017-10-10 17:34 GMT+02:00 Michael :
> I tried sequential reads ( dd if=/dev/rwd0c ... ) and throughput took a
> significant hit. I used to get about 120MB/s with the siisata, now it
> fluctuates between 80 and 90MB/s, ahcisata dropped from about 80MB/s to
> 70MB/s. Both spinning rust of varying vintage.
> I should probably do a bonnie run on either one before & after to see
> if there's any change in random access.

I've seen this on one of my disks, too. It seems it's much slower in NCQ
mode. I think the firmware might not utilise the disk cache properly when
in NCQ mode.

You can try switching it off via sysctl, hw.wdX.use_ncq. You can also try
to turn off use_ncq_prio if that makes any difference.

We might need to introduce some heuristics for this.

Jaromir

Re: New panic in wdc_ata_bio_intr

2017-10-10 Thread Jaromír Doleček

Hey,

can you try with dev/scsipi/atapi_wdc.c 1.128? That should resolve the
timeouts for atapi, at least it did for me.

Jaromir

2017-10-10 8:08 GMT+02:00 Rares Aioanei :

> I get that also on VBox, except it doesn't try to add cd0a as a swap
> device, nor does it show an endless stream of "lost interrupt"
> messages; eventually I get a login prompt. This is with yesterday's
> latest -CURRENT.
>
> On Sun, Oct 8, 2017 at 5:17 PM, Chavdar Ivanov  wrote:
> > I tried the same kernel on a VirtualBox guest - it doesn't crash, but one
> > gets endless
> >
> > piixide0:1:0: lost interrupt
> > type: atapi tc_bcount: 0 tc_skip: 0
> >
> > stream of messages. Also /etc/rc.d/swap2 start hangs while trying to add
> > /dev/cd0a as a dump device... as shown by ktruss.
> >
> > Weird.
> >
> > Chavdar
> >
> > On Sun, 8 Oct 2017 at 11:55 Chavdar Ivanov  wrote:
> >>
> >> System updated about two hours ago. I am getting:
> >>
> >> 
> >> wd0 at atabus0 drive 0
> >> wd0: 
> >> wd0: drive supports 16-sector PIO transfers, LBA48 addressing
> >> wd0: 298 GB, 620181 cyl, 16 head, 63 sec, 512 bytes/sect x 625142448
> >> sectors
> >> piixide0:0:0: bad state 0 in wdc_ata_bio_intr
> >> panic: wdc_ata_bio_intr: bad state
> >> fatal breakpoint trap in supervisor mode
> >> trap type 1 code 0 rip 0x8021c0c5 cs 0x8 rflags 0x246 cr2 0
> ilevel
> >> 0x8 rsp 0xe40040003c38
> >> curlwp 0xe4013bb27840 pid 0.2 lowest kstack 0xe40042c0
> >> Stopped at pid 0.2 (system) at netbsd:breakpoint+0x5: leave
> >> db{0}> bt
> >> breakpoint() at netbsd:breakpoint+0x5
> >> vpanic() at netbsd:vpanic+0x140
> >> snprintf() at netbsd:snprintf
> >> wdc_ata_bio_poll() at netbsd:wdc_ata_bio_poll
> >> intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
> >> Xintr_ioapic_edge10() at netbsd:Xintr_ioapic_edge10+0xee
> >> --- interrupt ---
> >> x86_mwait() at netbsd:x86_mwait+0xd
> >> acpicpu_cstate_idel_enter() at netbsd:acpicpu_cstate_idle_enter+0xdb
> >> acpicpu_cstate_idle() at netbsd:acpicpu_cstate_idle+0xb6
> >> idle_loop() at netbsd:idle_loop+0x18c
> >> db{0}>
> >> 
> >>
> >> (that is on my usual ThinkPad T61p).
> >>
> >> Couldn't get a crash dump.
> >>
> >> Chavdar Ivanov
> >>
> >
>

HEADS-UP: SATA NCQ support merged (from jdolecek-ncq branch)

2017-10-09 Thread Jaromír Doleček

Hi,

I've merged the NCQ branch to HEAD.

NCQ is supported on ahcisata(4), siisata(4), and mvsata(4) Gen IIe at this
moment.

The code was quite extensively tested on that harware on amd64. Other archs
and drivers compile, but I had no way to test them. Particularily, I had no
chance to really test any real IDE disks, neither any non-PCI controller
attachments. If you are in position to confirm them working, I'd appreciate
it.

Please report any problems, I'll fix any potential fallout ASAP.

Performance wise, I've tested so far only sequential I/O via fio and dd
with cache enabled. I observed very mild (>2%, 72.7->74.8 MB/s) performance
increase for spinning rust HDD, but quite big increase for SSD (350->450
MB/s). With disabled disk cache, or random I/O, NCQ will probably have more
effect.

Jaromir

Re: AMD Ryzen and NetBSD?

2017-07-03 Thread Jaromír Doleček

Fancy trying if it would behave differently with the NCQ branch?

Jaromir

2017-07-03 6:34 GMT+02:00 Thor Lancelot Simon :
>
> On Sun, Jul 02, 2017 at 10:57:20PM +0100, Patrick Welche wrote:
> > On Fri, Jun 30, 2017 at 12:00:45PM -0400, Thor Lancelot Simon wrote:
> >
> > I shoved a rather newer ST2000DM001-1CH164 in, which according to its
> > marketing bumpf can manage "Max SustainableTransfer Rate 210MB/s"
> > and not so bad:
> >
> > # dd if=/dev/zero ibs=64k | progress -l 976751887b dd of=/dev/rdk15
obs=64
> > k
> >  99% |** |   465 GiB  116.74 MiB/s
 00:00 ETAd
>
> This is already effectively double buffered, because of the way you used
> "progress".  You could try using a larger blocksize for the reads from
> /dev/zero (1m perhaps) and also for the writes to rdk15 - the kernel
> will buffer up and dispatch the MAXPHYS sized I/Os.
>
> To get 200MB out of that drive you likely need larger writes, which we
> currently can't do.  It might perform slightly better through the
> filesystem, though.
>
> Thor

HEADS-UP: jdolecek-ncq branch merge imminent

2017-06-07 Thread Jaromír Doleček

Hello,

I plan to merge the branch to HEAD very soon, likely over the weekend.
Eventual further fixes will be done on HEAD already, including mvsata(4)
restabilization, and potential switch of siisata(4) to support NCQ.

The plan is to get this pulled up to netbsd-8 branch soon also, so that it
will be part of 8.0.

Status:
- ahci(4) fully working with NCQ (confirmed with qemu, and real hw)
- piixide(4) continues working (no NCQ support of course) (confirmed in
qemu)
- siisata(4) continues working (without NCQ still) (confirmed with real hw)
- mvsata(4) not yet confirmed working after changes, mainly due the DMA not
really working on Marvell 88SX6042 which I have available - I have same
issue as kern/52126
- other ide/sata drivers received mechanical changes, should continue
working as before

Jaromir

wd* at umass? owners wanted for testing

2017-04-21 Thread Jaromír Doleček

Hi,

as part of work on jdolecek-ncq branch, I've made some mechanical
changes to adjust the umass_isdata.c driver code to still hopefully
work. I don't have the hardware though, so I'd like to have a real
confirmation from somebody who does, before the branch would be
merged.

If you have the hardware which is attached as wd* at umass?, can you
please try to boot the kernel from the branch, check if it actually
works and let me know the results?

Thank you.

Jaromir

Re: panic: kernel diagnostic assertion "next != _PSLIST_POISON"

2017-03-14 Thread Jaromír Doleček

Yes, this panic is already fixed in -current:

panic: kernel diagnostic assertion "!(bp->b_oflags & BO_DELWRI)"
failed: file "../../../../kern/vfs_wapbl.c", line 1142

Jaromir

2017-03-14 9:04 GMT+01:00 Frank Kardel :
> Hmm, I think ch_voltag_convert_in() is a red herring,
>
> Both panics contextually match the higher parts of the stack traces. So I
> would disregard the ch_voltag_convert_in() part here and
> conclude it is two distinct panics. One relates to psref corruption in
> network code and the other to wapbl and possibly
> recent mount update (-u) changes,
>
> Other ideas ?
>
> Frank
>
>
> On 03/14/17 08:56, Masanobu SAITOH wrote:
>>
>> Hi.
>>
>> On 2017/03/14 16:36, Frank Kardel wrote:
>>>
>>> Has anyone seen this panic recently?
>>>
>>> Seen in -current-20170311, i386, Soekris 6501.
>>>
>>> panic: kernel diagnostic assertion "next != _PSLIST_POISON" failed: file
>>> "/fs/raid2a/src/NetBSD/cur/src/sys/sys/pslist.h", line 270
>>> cpu0: Begin traceback...
>>>
>>> vpanic(c0cb1784,dba43dac,dba43e2c,c09e0d1e,c0cb1784,c0cb16d3,c0cb681b,c0cb6458,10e,a8)
>>> at netbsd:vpanic+0x121
>>>
>>> ch_voltag_convert_in(c0cb1784,c0cb16d3,c0cb681b,c0cb6458,10e,a8,0,c3d70578,c09e0988,c3d70348)
>>> at netbsd:ch_voltag_convert_in
>>>
>>> sysctl_iflist(4,cbd8cf60,c7,cbd8cff9,c33c06c0,c7,c090f986,0,cbd8cf60,a43e90)
>>> at c09e0d1e
>>>
>>> sysctl_rtable(dba43f0c,3,afe01000,dba43efc,0,0,dba43f00,c3de1560,c3c11c0c,3)
>>> at c09e129c
>>>
>>> sysctl_dispatch(dba43f00,6,afe01000,dba43efc,0,0,dba43f00,c3de1560,c3c11c0c,dba43efc)
>>> at netbsd:sysctl_dispatch+0xbd
>>>
>>> sys___sysctl(c3de1560,dba43f68,dba43f60,7dd51000,c3de1560,dba43f60,dba43f68,0,0,b0094fb0)
>>> at netbsd:sys___sysctl+0xe3
>>> syscall() at netbsd:syscall+0x257
>>> --- syscall (number 202) ---
>>> b00736f7:
>>> cpu0: End traceback...
>>>
>>> Frank
>>
>>
>> Yesterday I sent the following mail to current-users@ but it haven't
>> delivered yet...
>>
>>>  I updated my machine's kernel which was made from 1 hour ago's
>>> -current source. It paniced. It's reproducible.
>>>
 /dev/rwd0a: file system is clean; not checking
 panic: kernel diagnostic assertion "!(bp->b_oflags & BO_DELWRI)" failed:
 file "../../../../kern/vfs_wapbl.c", line 1142
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip 0x80215455 cs 0x8 rflags 0x246 cr2
 0x770e1f2ae190 ilevel 0 rsp 0xfe8120956b00
 curlwp 0xfe847b8820a0 pid 30.1 lowest kstack 0xfe81209532c0
 Stopped in pid 30.1 (mount_ffs) at netbsd:breakpoint+0x5:  leave
 db{15}> trace
 breakpoint() at netbsd:breakpoint+0x5
 vpanic() at netbsd:vpanic+0x140
 ch_voltag_convert_in() at netbsd:ch_voltag_convert_in
 wapbl_add_buf() at netbsd:wapbl_add_buf+0x133
 bdwrite() at netbsd:bdwrite+0xbd
 bwrite() at netbsd:bwrite+0x95
 ffs_sbupdate() at netbsd:ffs_sbupdate+0x1b9
 ffs_wapbl_start() at netbsd:ffs_wapbl_start+0x177
 ffs_mount() at netbsd:ffs_mount+0x4e9
 VFS_MOUNT() at netbsd:VFS_MOUNT+0x34
 do_sys_mount() at netbsd:do_sys_mount+0x5ee
 sys___mount50() at netbsd:sys___mount50+0x33
 syscall() at netbsd:syscall+0x1ed
 --- syscall (number 410) ---
 770e1f28989a:
 db{15}>
>>>
>>>
>>>  At least five days ago's kernel worked without this proble,
>>
>>
>> Both panics include ch_voltag_convert_in()
>>
>

Re: W^X mmap

2016-12-26 Thread Jaromír Doleček

I think you can avoid the #ifdef in uvm_mmap.c by simply definining
the macro PAX_MPROTECT_ADJUST() to return 0 if the feature is off.

Also, it would be wiser to just add error handling to the call in
uvm_unix.c, rather then assuming it never fails. Or just remove the
call there if it's so redundant - isn't the same check and adjustment
actually done by uvm_mmap() later?

I also see that after your change, pmap_mprotect_adjust() doesn't
adjust the flags on error path any more - is that wise, what happens
if caller ignores the EACCESS? I guess it would be safer to ensure the
flags are sanitized regardless.

Oh, and yes, I think EACCESS is nicer than EOPNOTSUPP, as the latter
is more usually used for unsupported system calls rather then
individual flags. I think actually EINVAL is more appropriate then
either of those two, however - EACCESS seems to be more concerned with
protection flags versus the descriptor mode rather than invalid flags.

Jaromir

2016-12-26 20:11 GMT+01:00 Pierre Pronchery :
> Hi,
>
> I have simplified the patch, changed it to return EACCES upon errors,
> adapted it to -current, and tested it there (both with PAX_MPROTECT set and
> not set). It is still not 100% elegant though (adds an #ifdef) so I will
> welcome ideas on how to improve it some more.
>
> Cheers,
> -- khorben
>
>
> On 26/12/2016 00:10, Pierre Pronchery wrote:
>>
>> On 10/12/2016 14:02, Michael van Elst wrote:
>>>
>>> co...@sdf.org writes:
>>>
 Why doesn't the following code get rejected by pax mprotect?
>>>
>>>
 a = mmap(NULL, BUFSIZ, PROT_READ | PROT_WRITE | PROT_EXEC,
 MAP_ANON, -1, 2);
>>>
>>>
>>> It gets 'rejected' by silently dropping the PROT_EXEC flag.
>>
>>
>> I find this awful: programs trying to use e.g JIT will fail to detect
>> that it really is not supported, and crash later instead.
>>
>> I am attaching here a patch returning errors instead.
>>
>> Thanks to this patch, www/firefox works without having to set the "m:
>> mprotect(2) restrictions, explicit disable" flag on its executable
>> binaries (tested on netbsd-7/amd64).
>>
>>> POSIX would require mmap to fail with errno = EACCES.
>>
>>
>> In the patch attached I have used ENOTSUP, because this is what OpenBSD
>> seems to be using:
>> http://man.openbsd.org/mmap.2
>>
>> I also think EACCES (or EPERM?) would be better though, so I will be
>> happy to replace it if considered more appropriate.
>>
>> I have changed the logic deciding which flags to drop. It used to be,
>> independently of whether PROT_READ is set:
>> - if PROT_WRITE, or PROT_WRITE and PROT_EXECUTE are set, then execution
>>   is silently denied;
>> - otherwise, writing is silently denied.
>> (which doesn't make much sense to me)
>>
>> Now there would be only one case instead:
>> - if PROT_WRITE and PROT_EXECUTE are set, execution is denied and an
>>   error is returned.
>>
>> Another thing I will really need to know before committing this, is
>> whether the changes should really be applied to sys_mmap() only.
>> Finally, I left a XXX where there might be a side-effect, if applied in
>> sys/kern/exec_subr.c, after calling vn_rdwr().
>>
>> Cheers,
>
>
> --
> khorben

Re: ffs_newvnode: inode has non zero blocks

2016-11-08 Thread Jaromír Doleček

> | There are some further changes needed to cover a possible dup alloc ,
> | and to keep the !wapbl case recoverable by fsck. There is ongoing
> | discussion on source-changes about that, hope we finalise fix later in
> | the week.
>
> Leaving a filesystem problem committed on head that can cause filesystem
> corruption for a week is not considerate to people who use current.

There shouldn't be anything causing filesystem corruption any more. I
will resolve this soon.

Jaromir

Re: ffs_newvnode: inode has non zero blocks

2016-11-08 Thread Jaromír Doleček

Yes, that problem is related to the wapbl change. I've committed a bug
fix, so newer kernel shouldn't trigger the panic any more.

There are some further changes needed to cover a possible dup alloc ,
and to keep the !wapbl case recoverable by fsck. There is ongoing
discussion on source-changes about that, hope we finalise fix later in
the week.

Jaromir

2016-11-07 23:07 GMT+01:00 Andreas Gustafsson :
> Earlier, I wrote:
>> Also, I have now narrowed down the appearance of the problem on the
>> testbed to the following commit:
>>
>>   2016.10.30.15.01.46 christos src/sys/ufs/ffs/ffs_alloc.c 1.154
>>
>> The mystery remains because the commit message says there should be no
>> functional change, and I also did a quick review of the diff and did
>> not spot anything that could be a cause of the panic.
>
> I have now also run a bisection of this on my own testbed, and it
> identified a different commits than the TNF testbed did:
>
>   2016.10.28.20.38.12 jdolecek src/sys/kern/vfs_wapbl.c 1.85
>   2016.10.28.20.38.12 jdolecek src/sys/sys/wapbl.h 1.19
>   2016.10.28.20.38.12 jdolecek src/sys/ufs/ffs/ffs_alloc.c 1.153
>   2016.10.28.20.38.12 jdolecek src/sys/ufs/ffs/ffs_inode.c 1.118
>   2016.10.28.20.38.12 jdolecek src/sys/ufs/ffs/ffs_snapshot.c 1.143
>   2016.10.28.20.38.12 jdolecek src/sys/ufs/ufs/ufs_extern.h 1.83
>   2016.10.28.20.38.12 jdolecek src/sys/ufs/ufs/ufs_inode.c 1.97
>   2016.10.28.20.38.12 jdolecek src/sys/ufs/ufs/ufs_rename.c 1.13
>   2016.10.28.20.38.12 jdolecek src/sys/ufs/ufs/ufs_vnops.c 1.233
>   2016.10.28.20.38.12 jdolecek src/sys/ufs/ufs/ufs_wapbl.h 1.12
>
> More data at:
>
>   
> http://releng.netbsd.org/b5reports/sparc/commits-2016.10.html#2016.10.30.15.01.46
>   
> http://www.gson.org/netbsd/bugs/build/sparc/commits-2016.10.html#2016.10.28.20.38.12
>
> --
> Andreas Gustafsson, g...@gson.org

Re: ffs_newvnode: inode has non zero blocks

2016-11-02 Thread Jaromír Doleček

There was recenly change a change in FFS in the general area for
WAPBL. Can you try attached patch and check if following KASSERT()
triggers?

2016-11-02 18:39 GMT+01:00 Andreas Gustafsson :
> co...@sdf.org wrote:
>> I'm pretty 'abusive' to my machine. unsurprisingly, I've managed to 
>> accumulate a problem:
>>
>>   ffs_newvnode: ino=20681997 on /: gen 5ae8a721/5ae8a721 has non zero blocks 
>> 980 or size 0
>>   panic: ffs_newvnode: dirty filesystem?
>
> The TNF sparc testbed recently started panicing with a similar error in
> every test run:
>
>   sbin/resize_ffs/t_grow_swapped (445/663): 4 test cases
>   grow_16M_v0_65536: ffs_newvnode: ino=45826 on /: gen 65327e67/65327e67 
> has non zero blocks 180 or size 0
>   panic: ffs_newvnode: dirty filesystem?
>   cpu0: Begin traceback...
>   0x0(0xf04010b8, 0xf4538a50, 0xf04a3800, 0xf04a4400, 0xf04a45c0, 0x104) at 
> netbsd:panic+0x20
>   panic(0xf04010b8, 0xf03c39a0, 0x0, 0xb302, 0xf07578d4, 0xf047e000) at 
> netbsd:ffs_newvnode+0x444
>   ffs_newvnode(0xf073, 0xf0970328, 0x81a4, 0xf4538cb0, 0xf069cb28, 
> 0xf0984810) at netbsd:vcache_new+0x5c
>   vcache_new(0xf073, 0xf0970328, 0xf4538cb0, 0xf069cb28, 0xf4538b74, 0x0) 
> at netbsd:ufs_makeinode+0x14
>   ufs_makeinode(0xf4538cb0, 0xf0970328, 0xf096ef2c, 0xf4538dcc, 0xf4538de0, 
> 0xf0926460) at netbsd:ufs_create+0x30
>   ufs_create(0xf4538c3c, 0xfff8, 0x0, 0x0, 0xf096ef2c, 0xf0970328) at 
> netbsd:VOP_CREATE+0x28
>   VOP_CREATE(0xf0970328, 0xf4538dcc, 0xf4538de0, 0xf4538cb0, 0xf0002000, 
> 0xf0785150) at netbsd:vn_open+0x24c
>   vn_open(0x0, 0x602, 0x1a4, 0xf069cb28, 0xf0851000, 0xf4538db8) at 
> netbsd:do_open+0x90
>   do_open(0x0, 0x0, 0xf0785150, 0x602, 0x1a4, 0xf4538ec4) at 
> netbsd:do_sys_openat+0x60
>   do_sys_openat(0xf0aa05a0, 0xff9c, 0xeda08080, 0x601, 0x1a4, 0xf4538ec4) 
> at netbsd:sys_open+0x18
>   sys_open(0xf0aa05a0, 0xf4538f30, 0xf4538f28, 0xeda08080, 0x0, 0x169b04f) at 
> netbsd:syscall+0x248
>   syscall(0xc05, 0xf4538fb0, 0xedc06b58, 0x5, 0x4e, 0xf0aa05a0) at 
> netbsd:memfault_sun4m+0x3f4
>   cpu0: End traceback...
>
> More logs at:
>
>   
> http://releng.netbsd.org/b5reports/sparc/commits-2016.10.html#2016.10.30.19.33.49
>
> The strange thing is that this problem seems to have started soon
> after your report, not before it as I would expect if it were also the
> cause of your crash.  The filesystems involved are all newly created
> in each test run.
> --
> Andreas Gustafsson, g...@gson.org
Index: ffs_inode.c
===
RCS file: /cvsroot/src/sys/ufs/ffs/ffs_inode.c,v
retrieving revision 1.118
diff -u -r1.118 ffs_inode.c
--- ffs_inode.c 28 Oct 2016 20:38:12 -  1.118
+++ ffs_inode.c 2 Nov 2016 21:15:11 -
@@ -543,6 +543,7 @@
oip->i_size = length;
DIP_ASSIGN(oip, size, length);
DIP_ADD(oip, blocks, -blocksreleased);
+   KASSERT((DIP(oip, size) == 0) == (DIP(oip, blocks) == 0));
genfs_node_unlock(ovp);
oip->i_flag |= IN_CHANGE;
UFS_WAPBL_UPDATE(ovp, NULL, NULL, 0);

Re: Wapbl correct and stable again?

2016-10-21 Thread Jaromír Doleček

2016-10-21 1:56 GMT+02:00 bch :
> I just had a kernel fault (might be audio subsystem, will investigate), but
> with this latest (7.99.40) kernel I'm still getting corruption. I don't know
> if it's new code in the filesystem, or bad luck that I'm faulting so much,
> exposing something that's already been there for a while, but this seems
> pretty unstable.

There weren't really any significant changes to wapbl or ffs so far.
Just some slight refactoring, to prepare code for first real fixes.

Within kernel, there is some active development with wm(4) and network
subsystem, maybe that could be related to your problem. Those however
shouldn't really matter for your local build.sh builds.

If you want, you can try reverting all the latest wapbl changes in
your local checkout, and see if it improves situation for you:
sys/kern/vfs_wapbl.c to rev 1.78
sys/sys/wapbl.h to rev. 1.17
sys/ufs/ffs_wapbl.c to rev 1.32
sys/ufs/ffs_extern.h to rev 1.82
sys/ufs/ffs_alloc.c to rev. 1.151

> On Oct 20, 2016 4:10 PM, "Thor Lancelot Simon"  wrote:
>>
>> Could the discards be entered into the log?

That should eventually happen, yes. That said, WAPBL needs any love it
can get, and trying to also fix it WRT discards wouldn't be really
helpful at this moment, in my opinion. That's why I plan to just go
with disabling discard when log is on, for now.

Discards really still need quite some attention. They need the fix WRT
reboots, performance fix to not be so horribly slow, then fix to not
hide the freed blocks in discard queue in order to not screw up block
allocation and hence cause fragmentation (it matters even for SSDs).
Then we can worry about making it really working with log :) It would
be awesome if someone would take care of it.

Jaromir

Re: WANTED: nvme(4) driver testing on MP systems on -current

2016-10-20 Thread Jaromír Doleček

I've now committed my fixes for NVMe driver, should be more stable
now, give it a try.

With those fixes, the driver works without any problem, even under
fairly heavy i/o load, when nvme.c and ld_nvme.c is compiled with -O0,
on both virtual and real MP machine. -O2 kernel works also on virtual
machine, but I've had an I/O lockup on real hw machine with -O2
kernel. It may have been unrelated, I'm still investigating.

Jaromir

2016-10-18 22:01 GMT+02:00 Jaromír Doleček <jaromir.dole...@gmail.com>:
> Hey,
>
> thank you. This iostat_unbusy panic is typical symptom of the current
> MP issues, the command completion queue gets corrupted, and
> nvme_q_complete() delivers some commands twice. It causes either this
> panic (due to duplicate lddone() for stale buf), or a random kernel
> crash.
>
> I've been working on debugging this for past two weeks or so. I have
> some local changes (mainly some volatile classifiers) which seem to
> fix this issue at least for my MP VirtualBox test machine. But these
> changes still do not fix the issue completely on another real system I
> have access to. I guess it would be useful to share the ongoing work
> at least. I'll polish and commit what I have, today or tomorrow.
>
> Jaromir
>
> 2016-10-18 10:40 GMT+02:00 Masanobu SAITOH <msai...@execsw.org>:
>> On 2016/09/22 5:54, Jaromír Doleček wrote:
>>>
>>> Hello,
>>>
>>> NVMe driver in NetBSD-current was recently tweaked to fix several MP and
>>> locking
>>> issues, and the driver is now marked as MPSAFE by default.
>>>
>>> Most of this work was done on emulators since I lack the the hardware,
>>> so it's not clear if
>>> everything would work properly on real systems too.
>>>
>>> Anyone having the hardware, I'd appreciate if you could check the
>>> driver out, and try
>>> to punish the drive by some heavy I/O test with parallel load if
>>> possible, and report
>>> results.
>>>
>>> The driver should work on i386 and amd64, and is enabled in
>>> INSTALL/GENERIC kernels there,
>>> so you could just try to boot install iso from NetBSD daily builds,
>>> and send-pr any
>>> issues.
>>>
>>> I'd also especially welcome if someone with sparc64 system could test
>>> the driver out, too.
>>> The driver originates from OpenBSD where nvme(4) is enabled in GENERIC
>>> sparc64
>>> kernel, so it should work. But it was not confirmed yet on
>>> NetBSD/sparc64. Note you might
>>> need fairly modern system, at least some Intel NVMe cards require PCIe
>>> Generation 3 to
>>> actually work, so this rules out e.g. T1s.
>>>
>>> I'd also very welcome any benchmark results, it would be very
>>> interesting to share some
>>> IOPS figures.
>>>
>>> Let me know the results, I'd like to update driver manpage to list
>>> known working hardware.
>>>
>>> In any reports, please include the attachment fragment from dmesg, as
>>> there
>>> is quite significant different between attachment via apic/INTx and
>>> MSI/MSI-X.
>>> Also useful would be intrctl(8) output, to confirm interrupt handlers
>>> are dispatched
>>> properly to individual available CPUs.
>>>
>>> Thank you.
>>>
>>> Jaromir
>>>
>>
>> With nvme.c rev. 1.16:
>>
>>> Oct 18 17:14:02 five savecore: reboot after panic: panic:
>>> ioWsAtRNatI_NWG:Au nRSNPILN GbNuO:Ts  SLPOyLW E RN
>>
>>
>> and,
>>
>>> five# crash -M netbsd.36.core -N /netbsd
>>> Crash version 7.99.39, image version 7.99.39.
>>> System panicked: iostat_unbusy
>>> Backtrace from time of crash is available.
>>> crash> trace
>>> _KERNEL_OPT_NVGA_RASTERCONSOLE() at 0
>>> ?() at 80008f0e5240
>>> vpanic() at vpanic+0x149
>>> snprintf() at snprintf
>>> iostat_isbusy() at iostat_isbusy
>>> dk_done1() at dk_done1+0xab
>>> lddone() at lddone+0xf
>>> nvme_q_complete() at nvme_q_complete+0xc6
>>> softint_dispatch() at softint_dispatch+0xd3
>>> DDB lost frame for Xsoftintr+0x4f, trying 0xfe810e919ff0
>>> Xsoftintr() at Xsoftintr+0x4f
>>> --- interrupt ---
>>> 0:
>>
>>
>> Again, the panic message was:
>>
>>> Oct 18 17:14:02 five savecore: reboot after panic: panic:
>>> ioWsAtRNatI_NWG:Au nRSNPILN GbNuO:Ts  SLPOyLW E RN
>>
>>
>> -> panic: iostat_unbust
>> -> WARNINWG:A RSNPILN GNO:T  SLPOLW E RN
>>
>>   -> WARNING: SPL NOT LOWER
>>   -> WARNING: SPL N
>>
>> The full dmesg is at:
>>
>> http://www.netbsd.org/~msaitoh/nvme-20161018-0.log
>>
>> Any test code are welcomed!
>>
>> --
>> ---
>> SAITOH Masanobu (msai...@execsw.org
>>  msai...@netbsd.org)

Re: Kernel faults, wapbl updates

2016-10-03 Thread Jaromír Doleček

Christos Zoulas wrote:
> Just back the all of yesterdays commit out. Just building with -j 8
> and LOCKDEBUG spins out. It trashes the filesystem and then it gets
> another error about not fixing an inode while replaying the log on
> reboot. I.e. the new kernel not only holds a spinlock and crashes,
> but also does not replay the log properly on boot.

Log replay was not touched. The code however didn't record properly
all deallocations even for some finished and committed transactions,
which caused the replay problems. This should now all be fixed, and
the mutex issue also.

After updating to newest kernel (with vfs_wapbl.c 1.84), it is
necessary to run fsck to get filesystem to fully healthy state. After
fsck, there shouldn't be any further problems related to the current
change.

Sorry about that and thanks for patience.

Jaromir

2016-10-02 16:46 GMT+02:00 Jaromír Doleček <jaromir.dole...@gmail.com>:
> There was a use-after-free bug which ended up with the fault on DEBUG
> kernels, it's fixed now in revision 1.82 of kern/vfs_wapbl.c
>
> Thank you.
>
> Jaromir
>
> 2016-10-02 1:26 GMT+02:00 bch <brad.har...@gmail.com>:
>> On 10/1/16, Jaromir Dolecek <jaromir.dole...@gmail.com> wrote:
>>> If you can get just a short traceback (which particular wapbl
>>> function(s) for example), it would help to figure possible problem.
>>
>> Here's a gdb backtrace from a core dump:
>>
>>
>> #0  0x80119a85 in cpu_reboot (howto=howto@entry=260,
>> bootstr=bootstr@entry=0x0) at
>> /usr/src/sys/arch/amd64/amd64/machdep.c:676
>> syncdone = false
>> s = 
>> #1  0x8086e3dc in vpanic (fmt=fmt@entry=0x80ed1503
>> "trap", ap=ap@entry=0xfe804105cb28) at
>> /usr/src/sys/kern/subr_prf.c:342
>> ci = 
>> oci = 
>> bootopt = 260
>> scratchstr = "trap", '\000' 
>> #2  0x8086e490 in panic (fmt=fmt@entry=0x80ed1503
>> "trap") at /usr/src/sys/kern/subr_prf.c:258
>> ap = > generic pointer.)>
>> #3  0x8011b706 in trap (frame=0xfe804105cc60) at
>> /usr/src/sys/arch/amd64/amd64/trap.c:298
>> p = 
>> pcb = 
>> vframe = 
>> ksi = {ksi_flags = 1, ksi_list = {tqe_next = 0x0, tqe_prev =
>> 0x0}, ksi_info = {_signo = 11, _code = 2, _errno = 0, _pad = 0,
>> _reason = {_rt = {_pid = 0, _uid = 0, _value = {sival_int = 6,
>>   sival_ptr = 0x6}}, _child = {_pid = 0, _uid = 0,
>> _status = 6, _utime = 0, _stime = 0}, _fault = {_addr = 0x0, _trap =
>> 6, _trap2 = 0, _trap3 = 0}, _poll = {_band = 0, _fd = 6}}},
>>   ksi_lid = 0}
>> onfault = 
>> type = 6
>> error = 
>> cr2 = 
>> pfail = 
>> #4  0x8010115e in alltraps ()
>> No symbol table info available.
>> #5  0x808cad70 in wapbl_write_revocations
>> (offp=0xfe804105cdc8, wl=0xfe811ce15688) at
>> /usr/src/sys/kern/vfs_wapbl.c:2343
>> wc = 0xfe811c747908
>> blocklen = 
>> off = 6082048
>> wd = 0xfe81deaddead
>> error = 
>> #6  wapbl_flush (wl=0xfe811ce15688, waitfor=waitfor@entry=0) at
>> /usr/src/sys/kern/vfs_wapbl.c:1618
>> bp = 
>> we = 
>> off = 6081536
>> head = 
>> tail = 
>> delta = 0
>> flushsize = 6996480
>> reserved = 
>> error = 
>> __func__ = "wapbl_flush"
>> #7  0x807a45c1 in ffs_sync (mp=0xfe811c95b008, waitfor=3,
>> cred=0xfe811e145f00) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1975
>> vp = 0x0
>> ump = 0xfe8108092b08
>> fs = 0xfe811beb5008
>> marker = 0xfe810a8b7930
>> error = 
>> allerror = 0
>> is_suspending = 
>> ctx = {waitfor = 3, is_suspending = false}
>> __func__ = "ffs_sync"
>> #8  0x808baaa1 in VFS_SYNC (mp=0xfe811c95b008,
>> a=, b=) at
>> /usr/src/sys/kern/vfs_subr.c:1358
>> error = 
>> #9  0x808bad20 in sched_sync (arg=) at
>> /usr/src/sys/kern/vfs_subr.c:785
>> slp = 
>> vp = 
>> mp = 0xfe811c95b008
>> nmp = 0xfe811c95b008
>> starttime = 1475352687
>> synced = true
>> #10 0x801008d7 in lwp_trampoline ()
>>
>>
>>
>>> The changes to vfs_wapbl.c were fairly minor so far. I would
>>>

Re: Kernel faults, wapbl updates

2016-10-02 Thread Jaromír Doleček

There was a use-after-free bug which ended up with the fault on DEBUG
kernels, it's fixed now in revision 1.82 of kern/vfs_wapbl.c

Thank you.

Jaromir

2016-10-02 1:26 GMT+02:00 bch :
> On 10/1/16, Jaromir Dolecek  wrote:
>> If you can get just a short traceback (which particular wapbl
>> function(s) for example), it would help to figure possible problem.
>
> Here's a gdb backtrace from a core dump:
>
>
> #0  0x80119a85 in cpu_reboot (howto=howto@entry=260,
> bootstr=bootstr@entry=0x0) at
> /usr/src/sys/arch/amd64/amd64/machdep.c:676
> syncdone = false
> s = 
> #1  0x8086e3dc in vpanic (fmt=fmt@entry=0x80ed1503
> "trap", ap=ap@entry=0xfe804105cb28) at
> /usr/src/sys/kern/subr_prf.c:342
> ci = 
> oci = 
> bootopt = 260
> scratchstr = "trap", '\000' 
> #2  0x8086e490 in panic (fmt=fmt@entry=0x80ed1503
> "trap") at /usr/src/sys/kern/subr_prf.c:258
> ap =  generic pointer.)>
> #3  0x8011b706 in trap (frame=0xfe804105cc60) at
> /usr/src/sys/arch/amd64/amd64/trap.c:298
> p = 
> pcb = 
> vframe = 
> ksi = {ksi_flags = 1, ksi_list = {tqe_next = 0x0, tqe_prev =
> 0x0}, ksi_info = {_signo = 11, _code = 2, _errno = 0, _pad = 0,
> _reason = {_rt = {_pid = 0, _uid = 0, _value = {sival_int = 6,
>   sival_ptr = 0x6}}, _child = {_pid = 0, _uid = 0,
> _status = 6, _utime = 0, _stime = 0}, _fault = {_addr = 0x0, _trap =
> 6, _trap2 = 0, _trap3 = 0}, _poll = {_band = 0, _fd = 6}}},
>   ksi_lid = 0}
> onfault = 
> type = 6
> error = 
> cr2 = 
> pfail = 
> #4  0x8010115e in alltraps ()
> No symbol table info available.
> #5  0x808cad70 in wapbl_write_revocations
> (offp=0xfe804105cdc8, wl=0xfe811ce15688) at
> /usr/src/sys/kern/vfs_wapbl.c:2343
> wc = 0xfe811c747908
> blocklen = 
> off = 6082048
> wd = 0xfe81deaddead
> error = 
> #6  wapbl_flush (wl=0xfe811ce15688, waitfor=waitfor@entry=0) at
> /usr/src/sys/kern/vfs_wapbl.c:1618
> bp = 
> we = 
> off = 6081536
> head = 
> tail = 
> delta = 0
> flushsize = 6996480
> reserved = 
> error = 
> __func__ = "wapbl_flush"
> #7  0x807a45c1 in ffs_sync (mp=0xfe811c95b008, waitfor=3,
> cred=0xfe811e145f00) at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1975
> vp = 0x0
> ump = 0xfe8108092b08
> fs = 0xfe811beb5008
> marker = 0xfe810a8b7930
> error = 
> allerror = 0
> is_suspending = 
> ctx = {waitfor = 3, is_suspending = false}
> __func__ = "ffs_sync"
> #8  0x808baaa1 in VFS_SYNC (mp=0xfe811c95b008,
> a=, b=) at
> /usr/src/sys/kern/vfs_subr.c:1358
> error = 
> #9  0x808bad20 in sched_sync (arg=) at
> /usr/src/sys/kern/vfs_subr.c:785
> slp = 
> vp = 
> mp = 0xfe811c95b008
> nmp = 0xfe811c95b008
> starttime = 1475352687
> synced = true
> #10 0x801008d7 in lwp_trampoline ()
>
>
>
>> The changes to vfs_wapbl.c were fairly minor so far. I would
>> understand new panics, but it would be strange if they caused faults.
>>
>> Maybe if you can try to downgrade ufs/ffs/ffs_alloc.c before rev.
>> 1.152. It's possible there is some interaction with wapbl which might
>> cause troubles there.
>>
>> Keep me on CC please, I'm working currently on WAPBL and planning some
>> further changes, so I'll fix any regressions asap.
>>
>> Jaromir
>>
>> 2016-10-01 22:50 GMT+02:00 bch :
>>> On Oct 1, 2016 1:44 PM, "bch"  wrote:

 This appears to be trashing files, too, based on what I see trying to
 CVS
 update
>>>
>>> Incl. author of potential troublesome commit.
>>>
 On Oct 1, 2016 1:32 PM, "bch"  wrote:
>
>
> My system is unstable w latest src. Appears to fault in wapbl
> functions.
> Sadly, this appears to correspond w updates in network interfaces, so my
> .38
> backup kernel won't cooperate with my .39 userland to bring up the
> network
> and update src and poll the machine for more info and send that out.
>>

Re: Kernel faults, wapbl updates

2016-10-01 Thread Jaromír Doleček

If you can get just a short traceback (which particular wapbl
function(s) for example), it would help to figure possible problem.

The changes to vfs_wapbl.c were fairly minor so far. I would
understand new panics, but it would be strange if they caused faults.

Maybe if you can try to downgrade ufs/ffs/ffs_alloc.c before rev.
1.152. It's possible there is some interaction with wapbl which might
cause troubles there.

Keep me on CC please, I'm working currently on WAPBL and planning some
further changes, so I'll fix any regressions asap.

Jaromir

2016-10-01 22:50 GMT+02:00 bch :
> On Oct 1, 2016 1:44 PM, "bch"  wrote:
>>
>> This appears to be trashing files, too, based on what I see trying to CVS
>> update
>
> Incl. author of potential troublesome commit.
>
>> On Oct 1, 2016 1:32 PM, "bch"  wrote:
>>>
>>>
>>> My system is unstable w latest src. Appears to fault in wapbl functions.
>>> Sadly, this appears to correspond w updates in network interfaces, so my .38
>>> backup kernel won't cooperate with my .39 userland to bring up the network
>>> and update src and poll the machine for more info and send that out.

WANTED: nvme(4) driver testing on MP systems on -current

2016-09-21 Thread Jaromír Doleček

Hello,

NVMe driver in NetBSD-current was recently tweaked to fix several MP and locking
issues, and the driver is now marked as MPSAFE by default.

Most of this work was done on emulators since I lack the the hardware,
so it's not clear if
everything would work properly on real systems too.

Anyone having the hardware, I'd appreciate if you could check the
driver out, and try
to punish the drive by some heavy I/O test with parallel load if
possible, and report
results.

The driver should work on i386 and amd64, and is enabled in
INSTALL/GENERIC kernels there,
so you could just try to boot install iso from NetBSD daily builds,
and send-pr any
issues.

I'd also especially welcome if someone with sparc64 system could test
the driver out, too.
The driver originates from OpenBSD where nvme(4) is enabled in GENERIC sparc64
kernel, so it should work. But it was not confirmed yet on
NetBSD/sparc64. Note you might
need fairly modern system, at least some Intel NVMe cards require PCIe
Generation 3 to
actually work, so this rules out e.g. T1s.

I'd also very welcome any benchmark results, it would be very
interesting to share some
IOPS figures.

Let me know the results, I'd like to update driver manpage to list
known working hardware.

In any reports, please include the attachment fragment from dmesg, as there
is quite significant different between attachment via apic/INTx and MSI/MSI-X.
Also useful would be intrctl(8) output, to confirm interrupt handlers
are dispatched
properly to individual available CPUs.

Thank you.

Jaromir

Re: Building on OS X - how?

2016-08-13 Thread Jaromír Doleček

FWIW, build of tools for both i386 and sparc64 finished without
problems for me on Mac OS X host (10.11.6), building from clean
sources.

Jaromir

2016-08-12 21:54 GMT+02:00 matthew green :
> Thor Lancelot Simon writes:
>> On Thu, Aug 11, 2016 at 04:05:06PM +0100, Robert Swindells wrote:
>> >
>> > >2) /usr/bin/cc:
>> > >Undefined symbols for architecture x86_64: "_iconv"
>> > >in external/gpl3/gcc/usr.bin/backend
>> >
>> > This should be in libc.
>>
>> For what value of "should"?  _iconv is in the implementation-defined
>> namespace.
>>
>> It's curious that this doesn't break the tools build, and doesn't
>> prevent using the built tools to build a kernel!  If this can break
>> the cross-build of the target compiler, I think we must have suddenly
>> sprouted a rather serious instance of host/target confusion.
>
> this fails building the native gcc, which requires a bunch of host
> tools to run.  going on your following post, there is a problem
> with genmatch.  i don't have access to any osx to test, so i'm not
> sure where to start looking.  there aren't too many rules used in
> the creation of "genmatch" binary - can you run "make cleandir"
> in usr.bin/backend/ and then "make MAKEVERBOSE=2 genmatch", and
> post all the commands run?  there probably will be a configure
> run in here, and perhaps the output of it also matters.
>
>
> .mrg.

69 matches

Mail list logo