Re: build.sh kernel does not finish with endless nbctfmerge run
Oops, forgot to "cvs update" -- all builds are working for me. Sorry for the noise ... -- J. Hannken-Illjes - hann...@mailbox.org > On 31. Mar 2024, at 12:14, J. Hannken-Illjes wrote: > >> On 31. Mar 2024, at 11:23, Ryo ONODERA wrote: >> >> chris...@astron.com (Christos Zoulas) writes: >> >>> In article <4b5a66e1-7a3e-48ce-9ace-f9249e75f...@mailbox.org>, >>> J. Hannken-Illjes wrote: >>>> I also added an abort() when _dwarf_get_reloc_size() returns on >>>> "/* unknown relocation. */" and this killed nbctfconvert() as >>>> >>>> _dwarf_get_reloc_size () >>>> _dwarf_elf_init () >>>> dwarf_elf_init () >>>> dw_read () >>>> main () >>>> >>>> For me nbctfmerge on kernels succeeds after up to 30 minutes, but the >>>> resulting CTF sections are long too big and look very strange: >>> >>> Yes, I also reproduced it. Back to the drawing board... >>> >>> christos >>> >> >> Anyway I can finish build.sh kernel=GENERIC now. >> Thanks for your investigations. > > Unfortunately broken again (for read only source at least) ... > > After Taylors commits last night: > > cvs rdiff -u -r1.217 -r1.218 src/tools/Makefile > cvs rdiff -u -r1.2 -r0 src/tools/elftoolchain/Makefile > cvs rdiff -u -r1.5 -r1.6 src/tools/elftoolchain/libdwarf/Makefile > > my clean build of amd64, i386 and sparc64 succeeded without any > problem. With your commit this morning: > > cvs rdiff -u -r1.6 -r1.7 \ > src/external/bsd/elftoolchain/dist/libdwarf/libdwarf_reloc.c > cvs rdiff -u -r1.8 -r1.9 src/tools/Makefile.nbincludes > cvs rdiff -u -r1.3 -r1.4 src/tools/elftoolchain/common/sys/Makefile > cvs rdiff -u -r0 -r1.1 src/tools/elftoolchain/common/sys/elfdefinitions.h > > amd64 and i386 still build, but sparc64 fails with: > > sh: cannot create elfdefinitions.h: read-only file system > --- elfdefinitions.h --- > *** Failed target: elfdefinitions.h > *** Failed commands: >${_MKTARGET_CREATE} >=> @# " create " sys/elfdefinitions.h >${TOOL_M4} -I${SRCDIR} -D SRCDIR=${SRCDIR} ${M4FLAGS} > elfdefinitions.m4 > ${.TARGET} > => /work/build/obj/tools.sparc64/bin/nbm4 > -I/work/build/src/tools/elftoolchain/common/sys/../../../../external/bsd/elftoolchain/dist/common/sys > -D > SRCDIR=/work/build/src/tools/elftoolchain/common/sys/../../../../external/bsd/elftoolchain/dist/common/sys >elfdefinitions.m4 > elfdefinitions.h > *** [elfdefinitions.h] Error code 1 > > -- > J. Hannken-Illjes - hann...@mailbox.org
Re: build.sh kernel does not finish with endless nbctfmerge run
> On 31. Mar 2024, at 11:23, Ryo ONODERA wrote: > > chris...@astron.com (Christos Zoulas) writes: > >> In article <4b5a66e1-7a3e-48ce-9ace-f9249e75f...@mailbox.org>, >> J. Hannken-Illjes wrote: >>> I also added an abort() when _dwarf_get_reloc_size() returns on >>> "/* unknown relocation. */" and this killed nbctfconvert() as >>> >>> _dwarf_get_reloc_size () >>> _dwarf_elf_init () >>> dwarf_elf_init () >>> dw_read () >>> main () >>> >>> For me nbctfmerge on kernels succeeds after up to 30 minutes, but the >>> resulting CTF sections are long too big and look very strange: >> >> Yes, I also reproduced it. Back to the drawing board... >> >> christos >> > > Anyway I can finish build.sh kernel=GENERIC now. > Thanks for your investigations. Unfortunately broken again (for read only source at least) ... After Taylors commits last night: cvs rdiff -u -r1.217 -r1.218 src/tools/Makefile cvs rdiff -u -r1.2 -r0 src/tools/elftoolchain/Makefile cvs rdiff -u -r1.5 -r1.6 src/tools/elftoolchain/libdwarf/Makefile my clean build of amd64, i386 and sparc64 succeeded without any problem. With your commit this morning: cvs rdiff -u -r1.6 -r1.7 \ src/external/bsd/elftoolchain/dist/libdwarf/libdwarf_reloc.c cvs rdiff -u -r1.8 -r1.9 src/tools/Makefile.nbincludes cvs rdiff -u -r1.3 -r1.4 src/tools/elftoolchain/common/sys/Makefile cvs rdiff -u -r0 -r1.1 src/tools/elftoolchain/common/sys/elfdefinitions.h amd64 and i386 still build, but sparc64 fails with: sh: cannot create elfdefinitions.h: read-only file system --- elfdefinitions.h --- *** Failed target: elfdefinitions.h *** Failed commands: ${_MKTARGET_CREATE} => @# " create " sys/elfdefinitions.h ${TOOL_M4} -I${SRCDIR} -D SRCDIR=${SRCDIR} ${M4FLAGS} elfdefinitions.m4 > ${.TARGET} => /work/build/obj/tools.sparc64/bin/nbm4 -I/work/build/src/tools/elftoolchain/common/sys/../../../../external/bsd/elftoolchain/dist/common/sys -D SRCDIR=/work/build/src/tools/elftoolchain/common/sys/../../../../external/bsd/elftoolchain/dist/common/sys elfdefinitions.m4 > elfdefinitions.h *** [elfdefinitions.h] Error code 1 -- J. Hannken-Illjes - hann...@mailbox.org
Re: build.sh kernel does not finish with endless nbctfmerge run
I also added an abort() when _dwarf_get_reloc_size() returns on "/* unknown relocation. */" and this killed nbctfconvert() as _dwarf_get_reloc_size () _dwarf_elf_init () dwarf_elf_init () dw_read () main () For me nbctfmerge on kernels succeeds after up to 30 minutes, but the resulting CTF sections are long too big and look very strange: Before this change I had total number of types = 30015 total number of integers= 65 total number of floats = 1 total number of pointers= 7902 total number of arrays = 3515 total number of func types = 2252 and now i have total number of types = 322862 total number of integers= 3865 total number of floats = 1 total number of pointers= 127495 total number of arrays = 14350 total number of func types = 65411 and running the merge with CTFMERGE_DEBUG_LEVEL=2 I get ERROR: nbctfmerge: Second pass for 5978 ((anon)) == 13434 -- J. Hannken-Illjes - hann...@mailbox.org > On 30. Mar 2024, at 15:23, Christos Zoulas wrote: > > I don't think that's the problem. I added abort() calls just before the > return 0 and > they never fire for me (and the kernel built has the right CTF information). > Nevertheless > I think that the relocation code is not used in the CTF code; it just parsers > the debug > dwarf into and builds CTF stabs from them. I think that the threading code in > ctf is > problematic because we had this problem in the past. > > christos > >> On Mar 29, 2024, at 9:43 PM, Ryo ONODERA wrote: >> >> Hi, >> >> The following two commits cause endless nbctfmerge run >> at end of build.sh kernel=GENERIC for me. >> My environment is NetBSD/amd64 10.99.10 of yesterday. >> >> Could you investigate my problem? >> >> Module Name: src >> Committed By: christos >> Date: Wed Mar 27 21:53:06 UTC 2024 >> >> Modified Files: >> src/external/bsd/elftoolchain/dist/libdwarf: libdwarf_reloc.c >> >> Log Message: >> Don't try to compile the arch-specific relocation code if we don't have the >> built-in headers (for tools) >> >> To generate a diff of this commit: >> cvs rdiff -u -r1.5 -r1.6 \ >> src/external/bsd/elftoolchain/dist/libdwarf/libdwarf_reloc.c >> >> Module Name: src >> Committed By: christos >> Date: Wed Mar 27 21:54:43 UTC 2024 >> >> Modified Files: >> src/tools: Makefile.nbincludes >> src/tools/elftoolchain/libdwarf: Makefile >> >> Log Message: >> Remove dependency to elfdefinitions.h, this is a mess, since it needs >> ${TOOL_M4} which might not be available yet. >> >> To generate a diff of this commit: >> cvs rdiff -u -r1.7 -r1.8 src/tools/Makefile.nbincludes >> cvs rdiff -u -r1.4 -r1.5 src/tools/elftoolchain/libdwarf/Makefile >> >> >> Thank you. >> >> -- >> Ryo ONODERA // r...@tetera.org >> PGP fingerprint = 82A2 DC91 76E0 A10A 8ABB FD1B F404 27FA C7D1 15F3 >
Re: Strange sensor names for amdzentemp(4)
> On Mar 20, 2024, at 4:27 PM, Paul Goyette wrote: > > Oddly, I am seeing the following sensor info. Note that the config > doesn't even contain ``ccd''. (Previous incarnations of this config > _did_ have ccd, but it's been completely removed when I changed to > use raidframe...) > > # envstat -d amdzentemp0 > Current CritMax WarnMax WarnMin CritMin Unit > cpu0 temperature:55.125 degC > cpu0 ccd0 temperature:36.375 degC > cpu0 ccd1 temperature:37.500 degc > # The string originates from sys/arch/x86/pci/amdzentemp.c line 471. In this context CCD is a synonym for Core Complex Die. -- J. Hannken-Illjes - hann...@mailbox.org
Re: ethernet
> On 23. Dec 2023, at 16:17, xuser wrote: > > Does any one know how to have two ip addresses on one interface? ifconfig IF inet ADDR alias -- J. Hannken-Illjes - hann...@mailbox.org
Re: Panic on dump over fss
> On 22. Mar 2023, at 15:03, César Catrián C. wrote: > >> /sbin/dump -0af - -x /var/tmp /home | ... >> >> Dump will take and release the snapshot for you, see "man dump". >> > > Got the same panic with the command: > > # /sbin/dump -0af - -x /var/tmp / | /usr/bin/bzip2 -1 > > /mnt/fs4/backups/current/root.dum > p.bz2 > DUMP: Found /dev/rdk0 on / in /etc/fstab > > [ 6216.0201020] panic: kernel diagnostic assertion "(req->req_bp->b_flags & > B_PHYS) != 0" failed: file > "/home/src/netbsd-current/src/sys/arch/xen/xen/xbd_xenbus.c", line 1374 > [ 6216.0201020] cpu0: Begin traceback... > [ 6216.0201020] vpanic() at netbsd:vpanic+0x183 > [ 6216.0201020] kern_assert() at netbsd:kern_assert+0x4b > [ 6216.0201020] xbd_diskstart() at netbsd:xbd_diskstart+0x7d4 > [ 6216.0201020] dk_start() at netbsd:dk_start+0xe0 > [ 6216.0201020] bdev_strategy() at netbsd:bdev_strategy+0x81 > [ 6216.0201020] spec_strategy() at netbsd:spec_strategy+0x6e > [ 6216.0201020] VOP_STRATEGY() at netbsd:VOP_STRATEGY+0x3c > [ 6216.0201020] dkstart() at netbsd:dkstart+0x184 > [ 6216.0201020] bdev_strategy() at netbsd:bdev_strategy+0x81 > [ 6216.0201020] fss_bs_thread() at netbsd:fss_bs_thread+0x32c > [ 6216.0201020] cpu0: End traceback... > > [ 6216.0201020] dumping to dev 168,1 (offset=8, size=1048576): > [ 6216.0201020] dump device bad Could you try the attached patch? -- J. Hannken-Illjes - hann...@mailbox.org fss.c.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: Panic on dump over fss
> On 22. Mar 2023, at 13:43, César Catrián C. wrote: > > Hi, please help with an issue while running dump(8) over a snapshotted > filesystem using fss. > > The system is current 10.99.2 pvh Xen VM, compiled with mar 18 2023 sources, > running under NetBSD Xen, current 10.99.2 also, with feb 23 2023 sources. > > Did other successful backups on two stable NetBSD 9.3 VMs. > > The dump file is being put over a nsf share in another NetBSD machine. > > These are the commands used: > /usr/sbin/fssconfig fss0 /home /var/tmp > /sbin/mount /dev/fss0 /mnt/fs3 > /sbin/dump -0af - /mnt/fs3 | /usr/bin/bzip2 > > /mnt/fs4/backups/current/home.dump.bz2 There is no need to mount the snapshot, the simplest way to dump is: /sbin/dump -0af - -x /var/tmp /home | ... Dump will take and release the snapshot for you, see "man dump". -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: blocklist puzzle
> On 18. Feb 2023, at 23:34, Patrick Welche wrote: > > 12 hours after rebooting > > # npfctl rule blocklistd list > block in final family inet4 proto tcp from 61.177.173.35/32 to any port 22 # > id="1" > # > > contains a single block, yet /var/log/messages is full: > > Feb 18 17:47:44 mail blocklistd[596]: blocked 195.226.194.142/32:22 for > 172800 seconds > Feb 18 18:18:00 mail blocklistd[596]: released 171.225.184.179/32:22 after > 172800 seconds > Feb 18 18:18:07 mail blocklistd[596]: blocked 195.226.194.142/32:22 for > 172800 seconds > Feb 18 18:35:18 mail blocklistd[596]: blocked 31.41.244.124/32:22 for 172800 > seconds > Feb 18 18:48:10 mail blocklistd[596]: blocked 195.226.194.242/32:22 for > 172800 seconds > Feb 18 19:18:02 mail blocklistd[596]: blocked 195.226.194.142/32:22 for > 172800 seconds > Feb 18 20:18:13 mail blocklistd[596]: blocked 195.226.194.142/32:22 for > 172800 seconds > Feb 18 20:47:46 mail blocklistd[596]: blocked 195.226.194.242/32:22 for > 172800 seconds > Feb 18 21:17:48 mail blocklistd[596]: blocked 195.226.194.242/32:22 for > 172800 seconds > Feb 18 21:47:55 mail blocklistd[596]: blocked 195.226.194.242/32:22 for > 172800 seconds > > > > If something were misconfigured, I would expect no hosts in the ruleset, > rather than some (or one). How can this work partially? > > extract of npf.conf: > > group "external" on $ext_if { >pass stateful out final all > >ruleset "blocklistd" > > ... Looks like your ruleset "blocklistd" never fires as the rule above is "final all". -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: 9.99.104: panic in tcp_shutdown_wrapper
> On 30. Oct 2022, at 06:52, Michael van Elst wrote: > > ozak...@netbsd.org (Ryota Ozaki) writes: > >> I've committed a possible fix. Could you try it? > >> Thanks, >> ozaki-r > > > I just got a NULL pointer dereference in tcp_ctloutput where > the previous check for inp == NULL is also missing. > > [ 24837.756043] fp c0016794db70 tcp_ctloutput() at c02ec4b4 > netbsd:tcp_ctloutput+0x94 > [ 24837.756043] fp c0016794dcc0 tcp_ctloutput_wrapper() at > c02d2680 netbsd:tcp_ctloutput_wrapper+-0x31150 > [ 24837.756043] fp c0016794dcf0 sosetopt() at c0603cbc > netbsd:sosetopt+0x78 > [ 24837.756043] fp c0016794ddb0 sys_setsockopt() at c060b0fc > netbsd:sys_setsockopt+0x7c > [ 24837.766041] fp c0016794de20 syscall() at c00b30fc > netbsd:syscall+0x19c > > That's: > > int > tcp_ctloutput(int op, struct socket *so, struct sockopt *sopt) > { > ... > s = splsoftnet(); >inp = sotoinpcb(so); > ... >} >tp = intotcpcb(inp); <- > >switch (op) { ... and Syzcaller (https://syzkaller.appspot.com/netbsd) has a bunch of new tcp related crashes starting ~2 days before ... -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: Doc error - sysctl
> On 25. Jul 2022, at 16:30, Paul Goyette wrote: > > It seems that kern.maxvnodes is dodcumented as "cannot be lowered" > > kern.maxvnodes (KERN_MAXVNODES) > The maximum number of vnodes available on the system. This can > only be raised. > > However, the kernel allows you to lower the value, and it helps if > you want to flush file cache (free up active memory). Yes, it can be lowered and will fail if you try to go below the number of active vnodes. Please go ahead and fix the documentation. > > > ++--+--+ > | Paul Goyette | PGP Key fingerprint: | E-mail addresses:| > | (Retired) | FA29 0E3B 35AF E8AE 6651 | p...@whooppee.com| > | Software Developer | 0786 F758 55DE 53BA 7731 | pgoye...@netbsd.org | > | & Network Engineer | | pgoyett...@gmail.com | > +--------+--+--+ -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: pgdaemon high CPU consumption
> On 1. Jul 2022, at 07:55, Matthias Petermann wrote: > > Good day, > > since some time I noticed that on several of my systems with NetBSD/amd64 > 9.99.97/98 after longer usage the kernel process pgdaemon completely claims a > CPU core for itself, i.e. constantly consumes 100%. > The affected systems do not have a shortage of RAM and the problem does not > disappear even if all workloads are stopped, and thus no RAM is actually used > by application processes. > > I noticed this especially in connection with accesses to the ZFS set up on > the respective machines - for example after checkout from the local CVS relic > hosted on ZFS. > > Is there already a known problem or what information would have to be > collected to get to the bottom of this? > > I currently have such a case online, so I would be happy to pull diagnostic > information this evening/afternoon. At the moment all info I have is from top. > > Normal view: > > ``` > PID USERNAME PRI NICE SIZE RES STATE TIME WCPUCPU COMMAND >0 root 1260 0K 34M CPU/0 102:45 100% 100% [system] > ``` > > Thread view: > > > ``` > PID LID USERNAME PRI STATE TIME WCPUCPU NAME COMMAND >0 173 root 126 CPU/1 96:57 98.93% 98.93% pgdaemon [system] > ``` Looks a lot like kern/55707: ZFS seems to trigger a lot of xcalls Last action proposed was to back out the patch ... -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: kernel deadlock on fstchg with vnd
> On 29. May 2022, at 23:57, Manuel Bouyer wrote: > > On Sun, May 29, 2022 at 01:18:16PM +0200, J. Hannken-Illjes wrote: >>> On 29. May 2022, at 08:30, Michael van Elst wrote: >>> >>> bou...@antioche.eu.org (Manuel Bouyer) writes: >>> >>>> Hello, >>>> do you have an idea on the problem in this thread: >>>> http://mail-index.netbsd.org/port-xen/2022/05/27/msg010213.html >>> [...] >>>> I can't reproduce this when using vnd from userland. >>> >>> You can replicate it by addressing the block device with vnconfig. >>> >>> A workaround would be to modify the Xen block script to select the >>> raw device: >>> >>> vnconfig /dev/r${disk}d $xparams >/dev/null; then >>> >>> or just the disk name: >>> >>> vnconfig ${disk} $xparams >/dev/null; then >> >> Good catch, sys/dev/vnd.c has this: >> >> 1751 static void >> 1752 vndclear(struct vnd_softc *vnd, int myminor) >> 1753 { >> 1754 struct vnode *vp = vnd->sc_vp; >> 1755 int fflags = FREAD; >> 1756 int bmaj, cmaj, i, mn; >> 1757 int s; >> 1758 >> 1759 #ifdef DEBUG >> 1760 if (vnddebug & VDB_FOLLOW) >> 1761 printf("vndclear(%p): vp %p\n", vnd, vp); >> 1762 #endif >> 1763 /* locate the major number */ >> 1764 bmaj = bdevsw_lookup_major(_bdevsw); >> 1765 cmaj = cdevsw_lookup_major(_cdevsw); >> 1766 >> 1767 /* Nuke the vnodes for any open instances */ >> 1768 for (i = 0; i < MAXPARTITIONS; i++) { >> 1769 mn = DISKMINOR(device_unit(vnd->sc_dev), i); >> 1770 vdevgone(bmaj, mn, mn, VBLK); >> 1771 if (mn != myminor) /* XXX avoid to kill own vnode */ >> 1772 vdevgone(cmaj, mn, mn, VCHR); >> 1773 } >> >> The "skip myself" on lines 1771/1772 is responsible for this behaviour. > > Yes and doing the same for block devices avoids the issue. > But Taylor is reluctant to commit this hack. And he is right. It smells fishy to detach a (pseudo) device from an open instance of itself, either with ioctl or close. Why do we detach on last close -- isn't it sufficient to detach either explicit with drvctl(8) or on module unload? The attached diff moves vdevgone() to vnd_detach() and no longer detaches on last close -- comments? -- J. Hannken-Illjes - hann...@mailbox.org vnd.c.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: kernel deadlock on fstchg with vnd
> On 29. May 2022, at 08:30, Michael van Elst wrote: > > bou...@antioche.eu.org (Manuel Bouyer) writes: > >> Hello, >> do you have an idea on the problem in this thread: >> http://mail-index.netbsd.org/port-xen/2022/05/27/msg010213.html > [...] >> I can't reproduce this when using vnd from userland. > > You can replicate it by addressing the block device with vnconfig. > > A workaround would be to modify the Xen block script to select the > raw device: > > vnconfig /dev/r${disk}d $xparams >/dev/null; then > > or just the disk name: > > vnconfig ${disk} $xparams >/dev/null; then Good catch, sys/dev/vnd.c has this: 1751 static void 1752 vndclear(struct vnd_softc *vnd, int myminor) 1753 { 1754 struct vnode *vp = vnd->sc_vp; 1755 int fflags = FREAD; 1756 int bmaj, cmaj, i, mn; 1757 int s; 1758 1759 #ifdef DEBUG 1760 if (vnddebug & VDB_FOLLOW) 1761 printf("vndclear(%p): vp %p\n", vnd, vp); 1762 #endif 1763 /* locate the major number */ 1764 bmaj = bdevsw_lookup_major(_bdevsw); 1765 cmaj = cdevsw_lookup_major(_cdevsw); 1766 1767 /* Nuke the vnodes for any open instances */ 1768 for (i = 0; i < MAXPARTITIONS; i++) { 1769 mn = DISKMINOR(device_unit(vnd->sc_dev), i); 1770 vdevgone(bmaj, mn, mn, VBLK); 1771 if (mn != myminor) /* XXX avoid to kill own vnode */ 1772 vdevgone(cmaj, mn, mn, VCHR); 1773 } The "skip myself" on lines 1771/1772 is responsible for this behaviour. -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)
> On 27. May 2022, at 16:24, Manuel Bouyer wrote: > > On Fri, May 27, 2022 at 02:52:55PM +0200, J. Hannken-Illjes wrote: >>> On 27. May 2022, at 14:41, Matthias Petermann wrote: >>> >>> Hello Jürgen, >>> >>> Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes: >>>> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump" >>>> should give even more details. >>> >>> here is the stacktrace from the vnconfig process (the PID has changed since >>> I restarted): >>> >>> https://www.petermann-it.de/tmp/p7.jpg >> >> This is the thread currently suspending the root fs (vrevoke suspends it). >> >> Looks like it is waiting for I/O to drain on the vnd device ... >> >>> You can find the output of fstrans_dump here: >>> >>> https://www.petermann-it.de/tmp/p8.jpg >> >> The owner is irritating, it should be vnconfig from above. > > I can reproduce it: What is the recipe? > db{0}> ps > PIDLID S CPU FLAGS STRUCT LWP * NAME WAIT > 2419 2419 3 8 0 9000210b9280 tcsh fstchg > 2415 2415 3 11 0 90001f66f540 vnconfig fstchg > 2416 2416 3 18 0 900020ea3200dirname fstchg > 2417 2417 3 24 0 900020e6c700 sh fstchg > 2414 2414 3 12 0 90001f6d7a00 vnconfig specio > [...] > db{0}> tr/t 0t2415 > trace: pid 2415 lid 2415 at 0x90008ed3e980 > sleepq_block() at netbsd:sleepq_block+0x12c > cv_wait() at netbsd:cv_wait+0x42 > fstrans_start() at netbsd:fstrans_start+0x193 > VOP_LOCK() at netbsd:VOP_LOCK+0x79 > vn_lock() at netbsd:vn_lock+0xae > namei_tryemulroot() at netbsd:namei_tryemulroot+0x1024 > namei() at netbsd:namei+0x29 > vn_open() at netbsd:vn_open+0x133 > do_open() at netbsd:do_open+0xc3 > do_sys_openat() at netbsd:do_sys_openat+0x74 > sys_open() at netbsd:sys_open+0x24 > syscall() at netbsd:syscall+0x18c > --- syscall (number 5) --- > netbsd:syscall+0x18c: > db{0}> tr/t 0t2414 > trace: pid 2414 lid 2414 at 0x90008c57e6c0 > sleepq_block() at netbsd:sleepq_block+0x12c > cv_wait() at netbsd:cv_wait+0x42 > spec_io_drain() at netbsd:spec_io_drain+0x84 > spec_close() at netbsd:spec_close+0x1c6 > VOP_CLOSE() at netbsd:VOP_CLOSE+0x38 > spec_node_revoke() at netbsd:spec_node_revoke+0x14d > vcache_reclaim() at netbsd:vcache_reclaim+0x4e7 > vgone() at netbsd:vgone+0xcd > vrevoke() at netbsd:vrevoke+0xfa > genfs_revoke() at netbsd:genfs_revoke+0x13 > VOP_REVOKE() at netbsd:VOP_REVOKE+0x35 > vdevgone() at netbsd:vdevgone+0x64 > vnddoclear.part.0() at netbsd:vnddoclear.part.0+0xaa > vndioctl() at netbsd:vndioctl+0x78c > bdev_ioctl() at netbsd:bdev_ioctl+0x91 > spec_ioctl() at netbsd:spec_ioctl+0xa5 > VOP_IOCTL() at netbsd:VOP_IOCTL+0x41 > vn_ioctl() at netbsd:vn_ioctl+0xb3 > sys_ioctl() at netbsd:sys_ioctl+0x555 > syscall() at netbsd:syscall+0x18c > --- syscall (number 54) --- > netbsd:syscall+0x18c: > db{0}> call fstrans_dump > Fstrans locks by lwp: > [ 5691.3454404] 2414.241 (/) shared 2 cow 0 alias 0 > [ 5691.3454404] Fstrans state by mount: > [ 5691.3454404] /owner 0x90001f6d7a00 state suspended > > In the ps output there is also: > 0 2324 3 3 200 90001fe43340 vnd0 vndbp > db{0}> tr/a 90001fe43340 > trace: pid 0 lid 2324 at 0x90008c806df0 > sleepq_block() at netbsd:sleepq_block+0x12c > vndthread() at netbsd:vndthread+0x78c > > So it looks like vnconfig waits for the vnd I/O to drain, but the vnd thread > is idle. No -- the name is confusing, it waits for spec_io_enter/exit to drain. Better ask Taylor ... -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)
> On 27. May 2022, at 14:41, Matthias Petermann wrote: > > Hello Jürgen, > > Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes: >> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump" >> should give even more details. > > here is the stacktrace from the vnconfig process (the PID has changed since I > restarted): > > https://www.petermann-it.de/tmp/p7.jpg This is the thread currently suspending the root fs (vrevoke suspends it). Looks like it is waiting for I/O to drain on the vnd device ... > You can find the output of fstrans_dump here: > > https://www.petermann-it.de/tmp/p8.jpg The owner is irritating, it should be vnconfig from above. > I hope this helps a bit in troubleshooting. > > Kind regards > Matthias -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)
> On 27. May 2022, at 14:06, Matthias Petermann wrote: > > Hi Manuel, > > Am 27.05.2022 um 12:14 schrieb Manuel Bouyer: >>> Paginated processes list: >>> >>> https://www.petermann-it.de/tmp/p1.jpg >>> https://www.petermann-it.de/tmp/p2.jpg >>> https://www.petermann-it.de/tmp/p3.jpg >> several processes in fstchg wait, a stack trace of these processes >> (tr/t 0t or tr/a 0x would show theses) would help. >> >> So it looks like a deadlock in the filesystem. What is your storage >> configuration ? >> > > Thanks for your advice - I did another series of screenshots and prepared the > relevant information here: > >https://www.petermann-it.de/tmp/p6.png > > My storage configuration this time is nothing out of the ordinary: > > ``` > wd0 (GPT) > | > '-- dk0 (NAME:root, FFSv2 with log, contains the root filesystem) > '---dk1 (NAME:swap) > '---dk2 (NAME:data, FFSv2 with log, contains VND-Images) > | > '-- net.img (16 GB sparse file image) > '-- net-export.img (500 GB sparse file image) > ``` > > Since you bring up the deadlock / filesystem assumption - I did an additional > test right away. My original test case uses both CPU cores in Dom0. The > modified test boots Dom0 with "dom0_max_vcpus=1 dom0_vcpus_pin" so that only > one core is available. With only one core in the Dom0 at least the VM is > instantiated (meaning the "xl create" command comes back as expected, and the > Dom0 stays responsive for a little while (in contrast to the original test - > I was now able to perform "xl list" and did see the VM. Anyway, Once I try to > "xl console" I did only get a fragment: > > ``` > ganymed$ doas xl console net > [ 1.000] cpu_rng: rdrand > [ 1.000] entropy: ready > [ 1.000] Copyright (c) 1996, 1997, 1998, 1999, > ``` > > At the "1999," the Dom0 became frozen, again. > > Kind regards > Matthias > Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump" should give even more details. -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE
> On 23. Apr 2022, at 14:45, Takahiro Kambe wrote: > > Hi, > > In message <029c86d6-e0d2-4c98-8798-4bdc39ba0...@mailbox.org> > on Sat, 23 Apr 2022 10:22:13 +0200, > "J. Hannken-Illjes" wrote: >>> On 23. Apr 2022, at 10:15, J. Hannken-Illjes wrote: >> >>> Please try the attached diff (with mount option "discard"). >> >> ... and remove the "#define TRIMDEBUG" from the top of ffs_alloc.c first ... > Thanks!! > > Now, it works fine with "discard" option. Committed. -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE
> On 23. Apr 2022, at 10:15, J. Hannken-Illjes wrote: > Please try the attached diff (with mount option "discard"). ... and remove the "#define TRIMDEBUG" from the top of ffs_alloc.c first ... -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE
> On 23. Apr 2022, at 05:17, Takahiro Kambe wrote: > > Hi, > > In message <15bebcc1-4756-46ad-a424-e5232065b...@mailbox.org> > on Fri, 22 Apr 2022 19:39:35 +0200, > "J. Hannken-Illjes" wrote: >>> #5 0x80e69314 in VOP_FDISCARD (vp=0x879fb253da40, >>> pos=, len=) >>> at /usr/src/sys/kern/vnode_if.c:845 >>> #6 0x80e69314 in VOP_FDISCARD (vp=0x879fb2e99cc0, >>> pos=pos@entry=5843857408, len=len@entry=2048) >>> at /usr/src/sys/kern/vnode_if.c:845 >> >> This one is different from the previous stack trace, two VOP_FDISCARD(). > # cat /etc/fstab > /dev/dk0/efimsdos rw,noauto 0 0 > /dev/dk4/ ffs rw,discard 1 1 Ok, you have wedges that introduce another indirection. Please try the attached diff (with mount option "discard"). -- J. Hannken-Illjes - hann...@mailbox.org fdiscard.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE
> On 22. Apr 2022, at 16:25, Takahiro Kambe wrote: > > Hi, > > In message > on Fri, 22 Apr 2022 09:44:48 +0200, > "J. Hannken-Illjes" wrote: >>>> Thanks - I can confirm that a kernel from yesterday doesn't have the >>>> issue any longer. >>> >>> I still have panic() on ThinkPad E495. >>> >>> panic: kernel diagnostic assertion "VOP_ISLOCKED(vp) == LK_EXCLUSIVE" >>> failed: file "/usr/src/sys/miscfs/specfs/spec_vnops.c", line 1252 >>> cpu3: Begin traceback... >>> vpanic() at netbsd:vpanic+0x183 >>> kern_assert() at netbsd:kern_assert+0x4b >>> spec_fdiscard() at netbsd:spec_fdiscard+0xaa >>> VOP_FDISCARD() at netbsd:VOP_FDISCARD+0x3d >>> ffs_discardcb() at netbsd:ffs_discardcb+0x2e >>> workqueue_worker() at netbsd:workqueue_worker+0xd7 >> >> >> Is the attached diff sufficient to fix your problem? > Sadly, no luck. > > (gdb) where > #0 0x802261f5 in cpu_reboot (howto=howto@entry=260, >bootstr=bootstr@entry=0x0) at /usr/src/sys/arch/amd64/amd64/machdep.c:720 > #1 0x80da5414 in kern_reboot (howto=howto@entry=260, >bootstr=bootstr@entry=0x0) at /usr/src/sys/kern/kern_reboot.c:73 > #2 0x80deb32d in vpanic ( >fmt=0x813938f8 "kernel %sassertion \"%s\" failed: file \"%s\", > line %d ", ap=ap@entry=0xb80249389e48) at /usr/src/sys/kern/subr_prf.c:293 > #3 0x80fad18f in kern_assert ( >fmt=fmt@entry=0x813938f8 "kernel %sassertion \"%s\" failed: file > \"%s\", line %d ") at /usr/src/sys/lib/libkern/kern_assert.c:51 > #4 0x80e76595 in spec_fdiscard (v=0xb80249389ee0) >at /usr/src/sys/miscfs/specfs/spec_vnops.c:1252 > #5 0x80e69314 in VOP_FDISCARD (vp=0x879fb253da40, >pos=, len=) >at /usr/src/sys/kern/vnode_if.c:845 > #6 0x80e69314 in VOP_FDISCARD (vp=0x879fb2e99cc0, >pos=pos@entry=5843857408, len=len@entry=2048) >at /usr/src/sys/kern/vnode_if.c:845 This one is different from the previous stack trace, two VOP_FDISCARD(). Could you print the vnodes and mounts from frame #6 and #5, ( print *vp and print *vp->v_mount ) please? What is mounted ( /etc/fstab and mount )? > #7 0x80cdf921 in ffs_discardcb (wk=0x879fb4199840, >arg=0x879fb3053f40) at /usr/src/sys/ufs/ffs/ffs_alloc.c:1656 > #8 0x80df4fd6 in workqueue_runlist (list=0x879fb2c34ee8, >list=0x879fb2c34ee8, wq=0x879fb2c34e80) >at /usr/src/sys/kern/subr_workqueue.c:105 > #9 workqueue_worker (cookie=0x879fb2c34e80) >at /usr/src/sys/kern/subr_workqueue.c:135 > #10 0x8020b327 in lwp_trampoline () > #11 0x in ?? () -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE
> On 21. Apr 2022, at 16:57, Takahiro Kambe wrote: > > In message > on Mon, 18 Apr 2022 09:56:55 +0200, > Thomas Klausner wrote: >>> Already committed by Taylor R Campbell as sequencer.c Rev. 1.79 >>> on 2022/04/16 11:13:10. >> >> Thanks - I can confirm that a kernel from yesterday doesn't have the >> issue any longer. > > I still have panic() on ThinkPad E495. > > panic: kernel diagnostic assertion "VOP_ISLOCKED(vp) == LK_EXCLUSIVE" failed: > file "/usr/src/sys/miscfs/specfs/spec_vnops.c", line 1252 > cpu3: Begin traceback... > vpanic() at netbsd:vpanic+0x183 > kern_assert() at netbsd:kern_assert+0x4b > spec_fdiscard() at netbsd:spec_fdiscard+0xaa > VOP_FDISCARD() at netbsd:VOP_FDISCARD+0x3d > ffs_discardcb() at netbsd:ffs_discardcb+0x2e > workqueue_worker() at netbsd:workqueue_worker+0xd7 Is the attached diff sufficient to fix your problem? -- J. Hannken-Illjes - hann...@mailbox.org ffs_alloc.c.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: reproducible kernel crash with quota
> On 21. Apr 2022, at 00:36, 6b...@6bone.informatik.uni-leipzig.de wrote: > > On Wed, 20 Apr 2022, J. Hannken-Illjes wrote: > >> Date: Wed, 20 Apr 2022 22:19:30 +0200 >> From: J. Hannken-Illjes >> To: 6b...@6bone.informatik.uni-leipzig.de >> Cc: current-users@netbsd.org, Manuel Bouyer >> Subject: [Extern] Re: reproducible kernel crash with quota >>> On 20. Apr 2022, at 22:10, 6b...@6bone.informatik.uni-leipzig.de wrote: >>> >>> On Tue, 19 Apr 2022, J. Hannken-Illjes wrote: >>> >>>> Date: Tue, 19 Apr 2022 11:07:48 +0200 >>>> From: J. Hannken-Illjes >>>> To: 6b...@6bone.informatik.uni-leipzig.de >>>> Cc: current-users@netbsd.org, Manuel Bouyer >>>> Subject: [Extern] Re: reproducible kernel crash with quota >>>>> On 19. Apr 2022, at 08:38, 6b...@6bone.informatik.uni-leipzig.de wrote: >>>> Please try again with both diffs applied. >>> >>> I tested with both patches. If I just enable querquota it seems to work. If >>> you also activate groupquota, the kernel crashes: >>> >>> output: >>> >>> /etc/rc.d/quota restart >>> Checking quotas:quotacheck: creating quota file //quota.group >> >> You have root (/) with quota? What exactly do you have in /etc/fstab? > > cat /etc/fstab > # NetBSD /etc/fstab > # See /usr/share/examples/fstab/ for more examples. > NAME=179d5ca2-7f26-476b-b544-823bd1849816 / ffs > rw,userquota,groupquota 1 1 I'm confused. With "/dev/ld0a / ffs rw,userquota,groupquota 1 1" in /etc/fstab and both patches applied I get: $ /etc/rc.d/quota restart Checking quotas: done. No line "creating quota file ..." -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: reproducible kernel crash with quota
> On 20. Apr 2022, at 22:10, 6b...@6bone.informatik.uni-leipzig.de wrote: > > On Tue, 19 Apr 2022, J. Hannken-Illjes wrote: > >> Date: Tue, 19 Apr 2022 11:07:48 +0200 >> From: J. Hannken-Illjes >> To: 6b...@6bone.informatik.uni-leipzig.de >> Cc: current-users@netbsd.org, Manuel Bouyer >> Subject: [Extern] Re: reproducible kernel crash with quota >>> On 19. Apr 2022, at 08:38, 6b...@6bone.informatik.uni-leipzig.de wrote: >>> >>> On Thu, 14 Apr 2022, J. Hannken-Illjes wrote: >>> >>>> Date: Thu, 14 Apr 2022 13:09:02 +0200 >>>> From: J. Hannken-Illjes >>>> To: 6b...@6bone.informatik.uni-leipzig.de >>>> Cc: current-users@netbsd.org, Manuel Bouyer >>>> Subject: [Extern] Re: reproducible kernel crash with quota >>>>> On 12. Apr 2022, at 08:52, 6b...@6bone.informatik.uni-leipzig.de wrote: >>>>> >>>>> Hello, >>>>> >>>>> since I already have some open bugs with reproducible kernel crashes, I'm >>>>> only writing this to the mailing list. >>>>> >>>>> how to reproduce the crash: /etc/rc.d/quota restart >>>>> >>>>> dmesg: >>>>> >>>>> [ 412.047595] panic: kernel diagnostic assertion >>>>> "dq->dq_ump->um_quotas[dq->dq _type] != vp" failed: file >>>>> "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978 >>>>> [ 412.047595] cpu8: Begin traceback... >>>>> [ 412.047595] vpanic() at netbsd:vpanic+0x156 >>>>> [ 412.057595] kern_assert() at netbsd:kern_assert+0x4b >>>>> [ 412.057595] dqflush() at netbsd:dqflush+0x92 >>>>> [ 412.057595] quota1_handle_cmd_quotaoff() at >>>>> netbsd:quota1_handle_cmd_quotaof f+0x120 >>>>> [ 412.057595] ufs_quotactl() at netbsd:ufs_quotactl+0x3d >>>>> [ 412.057595] VFS_QUOTACTL() at netbsd:VFS_QUOTACTL+0x22 >>>>> [ 412.057595] vfs_quotactl_quotaoff() at >>>>> netbsd:vfs_quotactl_quotaoff+0x1b >>>>> [ 412.057595] do_sys_quotactl() at netbsd:do_sys_quotactl+0xf1 >>>>> [ 412.067595] sys___quotactl() at netbsd:sys___quotactl+0x2e >>>>> [ 412.067595] syscall() at netbsd:syscall+0x196 >>>>> [ 412.067595] --- syscall (number 473) --- >>>>> [ 412.067595] netbsd:syscall+0x196: >>>>> [ 412.067595] cpu8: End traceback... >>>>> >>>>> [ 412.067595] dumping to dev 168,1 (offset=8, size=33425953): >>>>> [ 412.067595] dump >>>>> >>>>> >>>>> (gdb) target kvm netbsd.1.core >>>> >>>> >>>> I'm quite sure you have a /etc/fstab with "userquota,groupquota", yes? >>>> >>>> with gdb: >>>> >>>> frame 4 (dqflush()) >>>> print dq->dq_ump->um_quotas[0] >>>> print dq->dq_ump->um_quotas[1] >>>> >>>> gives the same vnode address for both fields, yes? >>>> >>>> If this is the case the attached diff should help, since 2012-01-30 >>>> group quota got enabled on the user quota file. >>>> >>>> As a workaround you could try to name the quota files in /etc/fstab >>>> like "groupquota=XXX/quota.group". >>> >>> You are right. I use groupquota and userquota in fstab. I tested the patch. >>> With patch there is no crash. But the /etc/rc.d/quota restart leads to the >>> blocking of the file system. You can only turn off the server. This also >>> happens when I only use userquota in the fstab. >> >> Sorry, forgot the second diff (now attached) that prevents looping >> when taking the quota off on a modified file system. >> >> Please try again with both diffs applied. > > I tested with both patches. If I just enable querquota it seems to work. If > you also activate groupquota, the kernel crashes: > > output: > > /etc/rc.d/quota restart > Checking quotas:quotacheck: creating quota file //quota.group You have root (/) with quota? What exactly do you have in /etc/fstab? > done. > > -> crash Are "dq->dq_ump->um_quotas[0]" and "dq->dq_ump->um_quotas[1]]" now different? > [ 448.325252] panic: kernel diagnostic assertion > "dq->dq_ump->um_quotas[dq->dq_type] != vp" failed: file > "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978 > [ 448.325252] cpu1: Begin traceback...
Re: reproducible kernel crash with quota
> On 19. Apr 2022, at 08:38, 6b...@6bone.informatik.uni-leipzig.de wrote: > > On Thu, 14 Apr 2022, J. Hannken-Illjes wrote: > >> Date: Thu, 14 Apr 2022 13:09:02 +0200 >> From: J. Hannken-Illjes >> To: 6b...@6bone.informatik.uni-leipzig.de >> Cc: current-users@netbsd.org, Manuel Bouyer >> Subject: [Extern] Re: reproducible kernel crash with quota >>> On 12. Apr 2022, at 08:52, 6b...@6bone.informatik.uni-leipzig.de wrote: >>> >>> Hello, >>> >>> since I already have some open bugs with reproducible kernel crashes, I'm >>> only writing this to the mailing list. >>> >>> how to reproduce the crash: /etc/rc.d/quota restart >>> >>> dmesg: >>> >>> [ 412.047595] panic: kernel diagnostic assertion >>> "dq->dq_ump->um_quotas[dq->dq _type] != vp" failed: file >>> "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978 >>> [ 412.047595] cpu8: Begin traceback... >>> [ 412.047595] vpanic() at netbsd:vpanic+0x156 >>> [ 412.057595] kern_assert() at netbsd:kern_assert+0x4b >>> [ 412.057595] dqflush() at netbsd:dqflush+0x92 >>> [ 412.057595] quota1_handle_cmd_quotaoff() at >>> netbsd:quota1_handle_cmd_quotaof f+0x120 >>> [ 412.057595] ufs_quotactl() at netbsd:ufs_quotactl+0x3d >>> [ 412.057595] VFS_QUOTACTL() at netbsd:VFS_QUOTACTL+0x22 >>> [ 412.057595] vfs_quotactl_quotaoff() at netbsd:vfs_quotactl_quotaoff+0x1b >>> [ 412.057595] do_sys_quotactl() at netbsd:do_sys_quotactl+0xf1 >>> [ 412.067595] sys___quotactl() at netbsd:sys___quotactl+0x2e >>> [ 412.067595] syscall() at netbsd:syscall+0x196 >>> [ 412.067595] --- syscall (number 473) --- >>> [ 412.067595] netbsd:syscall+0x196: >>> [ 412.067595] cpu8: End traceback... >>> >>> [ 412.067595] dumping to dev 168,1 (offset=8, size=33425953): >>> [ 412.067595] dump >>> >>> >>> (gdb) target kvm netbsd.1.core >> >> >> I'm quite sure you have a /etc/fstab with "userquota,groupquota", yes? >> >> with gdb: >> >> frame 4 (dqflush()) >> print dq->dq_ump->um_quotas[0] >> print dq->dq_ump->um_quotas[1] >> >> gives the same vnode address for both fields, yes? >> >> If this is the case the attached diff should help, since 2012-01-30 >> group quota got enabled on the user quota file. >> >> As a workaround you could try to name the quota files in /etc/fstab >> like "groupquota=XXX/quota.group". > > You are right. I use groupquota and userquota in fstab. I tested the patch. > With patch there is no crash. But the /etc/rc.d/quota restart leads to the > blocking of the file system. You can only turn off the server. This also > happens when I only use userquota in the fstab. Sorry, forgot the second diff (now attached) that prevents looping when taking the quota off on a modified file system. Please try again with both diffs applied. > Thank you for your efforts > > Regards > Uwe > > >> >>> >>> Maybe someone can fix the problem. >>> >>> >>> Thank you for your efforts >>> >>> >>> Regards >>> Uwe >> >> -- >> J. Hannken-Illjes - hann...@mailbox.org >> -- J. Hannken-Illjes - hann...@mailbox.org 003_quota_flag.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE
> On 16. Apr 2022, at 17:27, Tobias Nygren wrote: > > On Sat, 16 Apr 2022 16:51:31 +0200 > Thomas Klausner wrote: > >> panic: kernel diagnostic assertion "VOP_ISLOCKED(vp) == LK_EXCLUSIVE" >> failed: file "/usr/src/sys/miscfs/specfs/spec_vnops.c", line 1555 >> cpu1: Begin traceback... >> vpanic() >> kern_assert() >> spec_close() at netbsd:spec_close+0x2fc >> VOP_CLOE() at netbsd:vop_close+0x42 >> sequenceropen() at netbsd:sequenceropen+0x359 > > "cat /dev/sequencer" as a regular user is enough to trigger this. In > the midiseq_open() error path it is trying to VOP_CLOSE without the > vnode lock held. Maybe this patch helps. (Someone with filesystem > clue please sanity check this.) > > --- sys/dev/sequencer.c 31 Mar 2022 19:30:15 - 1.76 > +++ sys/dev/sequencer.c 16 Apr 2022 15:23:54 - > @@ -1452,8 +1452,9 @@ midiseq_open(int unit, int flags) > if ((mi.props & MIDI_PROP_CAN_INPUT) == 0) > flags &= ~FREAD; > if ((flags & (FREAD|FWRITE)) == 0) { > + vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); > VOP_CLOSE(vp, oflags, kauth_cred_get()); > - vrele(vp); > + vput(vp); > return NULL; > } Already committed by Taylor R Campbell as sequencer.c Rev. 1.79 on 2022/04/16 11:13:10. -- J. Hannken-Illjes - hann...@mailbox.org signature.asc Description: Message signed with OpenPGP
Re: reproducible kernel crash with quota
> On 12. Apr 2022, at 08:52, 6b...@6bone.informatik.uni-leipzig.de wrote: > > Hello, > > since I already have some open bugs with reproducible kernel crashes, I'm > only writing this to the mailing list. > > how to reproduce the crash: /etc/rc.d/quota restart > > dmesg: > > [ 412.047595] panic: kernel diagnostic assertion > "dq->dq_ump->um_quotas[dq->dq _type] != vp" failed: file > "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978 > [ 412.047595] cpu8: Begin traceback... > [ 412.047595] vpanic() at netbsd:vpanic+0x156 > [ 412.057595] kern_assert() at netbsd:kern_assert+0x4b > [ 412.057595] dqflush() at netbsd:dqflush+0x92 > [ 412.057595] quota1_handle_cmd_quotaoff() at > netbsd:quota1_handle_cmd_quotaof f+0x120 > [ 412.057595] ufs_quotactl() at netbsd:ufs_quotactl+0x3d > [ 412.057595] VFS_QUOTACTL() at netbsd:VFS_QUOTACTL+0x22 > [ 412.057595] vfs_quotactl_quotaoff() at netbsd:vfs_quotactl_quotaoff+0x1b > [ 412.057595] do_sys_quotactl() at netbsd:do_sys_quotactl+0xf1 > [ 412.067595] sys___quotactl() at netbsd:sys___quotactl+0x2e > [ 412.067595] syscall() at netbsd:syscall+0x196 > [ 412.067595] --- syscall (number 473) --- > [ 412.067595] netbsd:syscall+0x196: > [ 412.067595] cpu8: End traceback... > > [ 412.067595] dumping to dev 168,1 (offset=8, size=33425953): > [ 412.067595] dump > > > (gdb) target kvm netbsd.1.core I'm quite sure you have a /etc/fstab with "userquota,groupquota", yes? with gdb: frame 4 (dqflush()) print dq->dq_ump->um_quotas[0] print dq->dq_ump->um_quotas[1] gives the same vnode address for both fields, yes? If this is the case the attached diff should help, since 2012-01-30 group quota got enabled on the user quota file. As a workaround you could try to name the quota files in /etc/fstab like "groupquota=XXX/quota.group". > > Maybe someone can fix the problem. > > > Thank you for your efforts > > > Regards > Uwe -- J. Hannken-Illjes - hann...@mailbox.org quota_oldfiles.c.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: Unprivileged build can't build custom kernels
> On 31. Dec 2021, at 12:37, John D. Baker wrote: > > The recent changes to build "netbsd-${CONF}.debug" seems not to work > for unprivileged builds when building custom kernels as it wants to > install the file owned by "root": > > [...] > # link DAVID/netbsd > /r0/build/current/tools/amd64/bin/sparc--netbsdelf-ld -Map netbsd.map --cref > -n -T netbsd.ldscript -Ttext F0004000 -e start -X -X -o netbsd > ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o > NetBSD 9.99.93 (DAVID) #376: Fri Dec 31 02:44:31 CST 2021 > textdata bss dec hex filename > 4808082 118584 147752 5074418 4d6df2 netbsd > + mv -f netbsd netbsd.gdb > + /r0/build/current/tools/amd64/bin/sparc--netbsdelf-objcopy > --only-keep-debug netbsd.gdb netbsd-DAVID.debug > + /r0/build/current/tools/amd64/bin/sparc--netbsdelf-objcopy --strip-debug -p > -R .gnu_debuglink --add-gnu-debuglink=netbsd-DAVID.debug netbsd.gdb netbsd > + chmod 755 netbsd netbsd.gdb netbsd-DAVID.debug > --- /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug --- > # install /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug > /r0/build/current/tools/amd64/bin/sparc--netbsdelf-install -c -p -r -o root > -g bin -m 444 netbsd-DAVID.debug > /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug > sparc--netbsdelf-install: > /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug.inst.fO9ANt: > chown/chgrp: Operation not permitted > > *** Failed target: > /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug > *** Failed commands: >${_MKTARGET_INSTALL} >=> @echo '# ' "install " > /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug >${INSTALL_FILE} -o root -g bin -m 444 ${.ALLSRC} ${.TARGET} >=> /r0/build/current/tools/amd64/bin/sparc--netbsdelf-install -c -p > -r -o root -g bin -m 444 netbsd-DAVID.debug > /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug > *** [/r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug] Error > code 1 > > nbmake: stopped in /r0/build/current/obj/sparc/sys/arch/sparc/compile/DAVID > 1 error > > nbmake: stopped in /r0/build/current/obj/sparc/sys/arch/sparc/compile/DAVID > > ERROR: Failed to make debuginstall in > "/r0/build/current/obj/sparc/sys/arch/sparc/compile/DAVID" > *** BUILD ABORTED *** For me the attached diff works. It skips the install outside ${NETBSDSRCDIR}. -- J. Hannken-Illjes - hann...@mailbox.org Makefile.kern.inc.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: null mounts seem to lose directories?
> On 22. Aug 2021, at 09:26, nia wrote: > > I have various null mounts on top of a tmpfs: > > $ df -h > ... > tmpfs 87G 1.0G86G 1% > /sandbox/nb9-i386-trunk/chroot/1 > /sandbox/nb9-i386-trunk/data/bulklog 237G79G 158G 33% > /sandbox/nb9-i386-trunk/chroot/1/data/bulklog > ... > > Directories are not being synchronized properly across the null mount: > > procyon$ ls /sandbox/nb9-i386-trunk/chroot/1/data/bulklog/rust-1.52.1nb4/ > build.log checksum.log configure.log depends.log pre-clean.log work.log > procyon$ ls /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/ > ls: /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/: No such file or > directory > > The source of the null mount is a ZFS dataset: > > # zfs list > ... > tank/sandbox/nb9-i386-trunk/data 124G 158G 79.2G > /sandbox/nb9-i386-trunk/data > ... Using the attached script I see no problems. What are you doing between the "mount -t tmpfs", "mount -t null" and this "ls"? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig SCRIPT: mkdir -p /sandbox/nb9-i386-trunk/data zpool create -m /sandbox/nb9-i386-trunk/data tank /dev/md0 mkdir -p /sandbox/nb9-i386-trunk/chroot/1 mount -t tmpfs tmpfs /sandbox/nb9-i386-trunk/chroot/1 mkdir -p /sandbox/nb9-i386-trunk/data/bulklog mkdir -p /sandbox/nb9-i386-trunk/chroot/1/data/bulklog mount -t null /sandbox/nb9-i386-trunk/data/bulklog \ /sandbox/nb9-i386-trunk/chroot/1/data/bulklog df -h | egrep '(^File|/sand.*chr)' zfs list #mkdir -p /sandbox/nb9-i386-trunk/chroot/1/data/bulklog/rust-1.52.1nb4/ mkdir -p /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/ touch /sandbox/nb9-i386-trunk/chroot/1/data/bulklog/rust-1.52.1nb4/build.log.t touch /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/build.log.z ls /sandbox/nb9-i386-trunk/chroot/1/data/bulklog/rust-1.52.1nb4/ ls /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/ OUTPUT: Filesystem Size Used Avail %Cap Mounted on tmpfs 67G 4.0K67G 0% /sandbox/nb9-i386-trunk/chroot/1 /sandbox/nb9-i386-trunk/data/bulklog 1.8G23K 1.8G 0% /sandbox/nb9-i386-trunk/chroot/1/data/bulklog NAME USED AVAIL REFER MOUNTPOINT tank 304K 1.81G23K /sandbox/nb9-i386-trunk/data build.log.t build.log.z build.log.t build.log.z signature.asc Description: Message signed with OpenPGP
Re: 9.99.86 HEAD
> On 1. Jul 2021, at 21:04, David Holland wrote: > > On Thu, Jul 01, 2021 at 07:54:33PM +0200, J. Hannken-Illjes wrote: >> lookup_fastforward -> lookup_parsepath -> VOP_PARSEPATH -> ... -> >> fstrans_start > > Bleh. I had a feeling we were going to end up regretting that > fastforward code. :-| > >> According to vnode_if.src VOP_PARSEPATH(dvp...) should take a locked vnode >> but here this lock is missing. So either >> >> - make sure the vnode is locked so fstrans_start will no loner block. >> >> or >> >> - add FSTRANS=NO to vop_parsepath, file kern/vnode_if.src and allow unlocked >> vnodes: >> >> vop_parsepath { >> + FSTRANS=NO >>IN struct vnode *dvp; >> >> David? > > I thought the vnode was locked readonly in the fastforward path. Did I > misread? Or is that not good enough? Nope, the fastforward path takes namecache locks only. > Anyway, I think it's probably ok to change vop_parsepath to not > require locked vnodes at all. The only parsepath operation that does > anything other than string ops is rumpfs's, and it takes etfs_lock to > look in some tables that etfs_lock covers. Unless that's going to > interact badly with fstrans without the vnode lock covering it (seems > unlikely, but IDK) there shouldn't be a problem. This is ok, the vnode is referenced and comparing it to rootvnode is ok. > However, except in the fastforward code the vnode will be locked. So I > think it should be "= = =" in vnode_if.src. If you also need to add > FSTRANS=NO, that should be fine too. Setting "= = =" is ok, but it is only a comment. You also need "FSTRANS=NO" to prevent VOP_PARSEPATH() to take fstrans and deadlock. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: 9.99.86 HEAD
> On 1. Jul 2021, at 18:24, Martin Husemann wrote: > > I did not trust macppc / lockdebug so reproduced it on evbarm. > > Unfortunately nearly identical (not making any sense to me) output again... I'm quite sure one thread does something like lookup_fastforward -> lookup_parsepath -> VOP_PARSEPATH -> ... -> fstrans_start where dvp->v_mount is currently unmounting and therefore suspended. If lookup_fastforward holds a lock on vi_nc_lock we have a deadlock. According to vnode_if.src VOP_PARSEPATH(dvp...) should take a locked vnode but here this lock is missing. So either - make sure the vnode is locked so fstrans_start will no loner block. or - add FSTRANS=NO to vop_parsepath, file kern/vnode_if.src and allow unlocked vnodes: vop_parsepath { + FSTRANS=NO IN struct vnode *dvp; David? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: dump/restore out of range inode
> On 6. Jun 2021, at 10:10, Patrick Welche wrote: > > On Sat, Jun 05, 2021 at 06:45:24PM +0200, J. Hannken-Illjes wrote: >> Patrick, >> >> please try the attached diff so the "spcl.c_addr" test >> no longer runs off the spcl record. >> >> "blks" is used for multi-tape checkpointing and examining >> TS_INODE/TS_ADDR records should be sufficient as the are >> the only records that support holes in data. > > Thanks! With your patch, the dump | restore has been happily > running for about 12 hours now. Ok, will commit and request pullup next week. > In your previous email you mention: > >> This trace makes no sense, bitmaps (CLRI and BITS) don't have holes >> and therefore ignore the "c_addr" array. I have no idea how dumping >> a bitmap ends in the hole processing of flushtape(). > > Is it worth investigating further while I have the reproducer? No, this is an error that manifests on file systems with many inodes and therefore did not raise before. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: dump/restore out of range inode
Patrick, please try the attached diff so the "spcl.c_addr" test no longer runs off the spcl record. "blks" is used for multi-tape checkpointing and examining TS_INODE/TS_ADDR records should be sufficient as the are the only records that support holes in data. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig tape.c.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: dump/restore out of range inode
> On 5. Jun 2021, at 12:31, Patrick Welche wrote: > > On Sat, Jun 05, 2021 at 10:03:21AM -, Michael van Elst wrote: >> pr...@cam.ac.uk (Patrick Welche) writes: >> >>> How can gdb not see a spcl anywhere? >> >> /usr/include/protocols/dumprestore.h:#define spcl u_spcl.s_spcl >> >> spcl is just a define that got resolved by the compiler. > > ach... here it is(gdb) print u_spcl.s_spcl > > $2 = {c_type = 6, c_old_date = 0, c_old_ddate = 0, c_volume = 1, > c_old_tapea = 0, c_inumber = 397083647, c_magic = 424935705, > c_checksum = 1906085926, __c_ino = {__uc_dinode = {di_mode = 0, > di_nlink = 0, di_oldids = {0, 0}, di_size = 0, di_atime = 0, > di_atimensec = 0, di_mtime = 0, di_mtimensec = 0, di_ctime = 0, > di_ctimensec = 0, di_db = {0 }, di_ib = {0, 0, 0}, > di_flags = 0, di_blocks = 0, di_gen = 0, di_uid = 0, di_gid = 0, > di_modrev = 0}, __uc_ino = {__uc_mode = 0, __uc_spare1 = {0, 0, 0}, > __uc_size = 0, __uc_old_atime = 0, __uc_atimensec = 0, > __uc_old_mtime = 0, __uc_mtimensec = 0, __uc_spare2 = {0, 0}, > __uc_rdev = 0, __uc_birthtimensec = 0, __uc_birthtime = 0, > __uc_atime = 0, __uc_mtime = 0, __uc_spare4 = {0, 0, 0, 0, 0, 0, 0}, > __uc_file_flags = 0, __uc_spare5 = {0, 0}, __uc_uid = 0, __uc_gid = 0, > __uc_spare6 = {0, 0}}}, c_count = 48473, > c_addr = '\000' , > c_label = "none", '\000' , c_level = 0, > c_filesys = "/store/backup", '\000' , > c_dev = "/dev/rdk18", '\000' , > c_host = "quantz", '\000' , c_flags = 2, > c_old_firstrec = 0, c_date = 1622887657, c_ddate = 0, c_tapea = 10, > c_firstrec = 0, c_spare = {0 }} > (gdb) bt > #0 flushtape () at /usr/src/sbin/dump/tape.c:333 > #1 0x0020763e in writerec (dp=dp@entry=0x7f7ff3a01380 "", >isspcl=isspcl@entry=0) at /usr/src/sbin/dump/tape.c:168 > #2 0x00208e49 in dumpmap (map=, type=type@entry=6, >ino=ino@entry=397083647) at /usr/src/sbin/dump/traverse.c:716 > #3 0x0020b355 in main (argc=1, argv=0x7f7fe7e8) >at /usr/src/sbin/dump/main.c:646 > (gdb) list > 328 } > 329 > 330 blks = 0; > 331 if (iswap32(spcl.c_type) != TS_END) { > 332 for (i = 0; i < iswap32(spcl.c_count); i++) > 333 if (spcl.c_addr[i] != 0) > 334 blks++; > 335 } > 336 slp->count = lastspclrec + blks + 1 - iswap64(spcl.c_tapea); > 337 slp->tapea = iswap64(spcl.c_tapea); > (gdb) print i > $6 = > (gdb) print u_spcl.s_spcl.c_count > $7 = 48473 > (gdb) whatis u_spcl.s_spcl.c_addr > type = char [512] > > so guess optimized_out i >> 512 > > c_type==6 = TS_CLRI map of inodes deleted since last dump > > (a bit odd: > (gdb) print needswap > $11 = 0 > (gdb) print iswap32(u_spcl.s_spcl.c_count) > $10 = 1505558528 > ) > > Still puzzled... This trace makes no sense, bitmaps (CLRI and BITS) don't have holes and therefore ignore the "c_addr" array. I have no idea how dumping a bitmap ends in the hole processing of flushtape(). -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: zfs howto
> On 14. Feb 2021, at 02:55, Brad Spencer wrote: > > Chavdar Ivanov writes: > > [snip] > >>> I am not sure of the complete context of the statement, but I do this >>> all of the time with normal NetBSD NFS against a ZFS fileset. >>> >>> build% cat /etc/exports >>> /usr/installed_src/PKGSRC_2018Q4 -alldirs -maproot=root >>> anotherbuild.system.eldar.org >>> >>> build% zfs list /usr/installed_src/PKGSRC_2018Q4 >>> NAME USED AVAIL REFER MOUNTPOINT >>> tank/installed_src/PKGSRC_2018Q4 414M 250G 414M >>> /usr/installed_src/PKGSRC_2018Q4 >>> >>> >>> These are DOMUs running NetBSD 9.0_STABLE from around September. I have >>> not tried this with -current, but there are no crashes for me with 9.x. > > [snip] > >> >> I got it --- >> >> With the following entry in -etc-exports: >> >> /tank/t1 -maproot=0:10 -network 192.168.0/24 >> >> the NFS server crashes when /tank/t1 is zfs system. >> >> With the following one: >> >> /tank/t1 -maproot=root -network 192.168.0/24 >> >> it works fine. >> >> Mind you, '-maproot=0:10' is the first example from 'man exports' ... The trigger is '-maproot' with group(s), first bug is mountd leaving 'cr_gid' as -2 and setting the first group list member to 10 in this case. Second bug is ZFS setting illegal group id -2 aka 4294967294 to GID_NOBODY with id -2. Later this illegal id leads to null pointer dereference in zfs_log_create() at zfs_log.c:297 "lr->lr_gid = fuidp->z_fuid_group" where fuidp is NULL. With the attached diff the ZFS bug gets fixed and your export works. > Glad to see that it isn't totally broken. I am by no means an expert in > the ZFS code, and I am not in a position to take a lot of time looking > at it right now, but if the trace back in the PR is correct, it makes it > almost totally though the mkdir call and crashes in the log create step > after the directory node is created. I am trying not to speculate too > much here, but the code may fail to handle the group in the exports > line. > > > > > > > -- > Brad Spencer - b...@anduin.eldar.org - KC8VKS - http://anduin.eldar.org -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig zfs_context.h.diff Description: Binary data signature.asc Description: Message signed with OpenPGP
Re: Automated report: NetBSD-current/i386 test failure
> On 16. Jun 2020, at 12:42, NetBSD Test Fixture wrote: > > This is an automatically generated notice of new failures of the > NetBSD test suite. > > The newly failing test cases are: > >fs/vfs/t_full:nfs_fillfs [snip] >2020.06.14.23.38.25 kamil src/sys/rump/include/rump/rump.h,v 1.72 This commit seems to be the cause ... -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: panic on zpool create
> On 6. Jan 2020, at 15:04, David Brownlee wrote: > > I've just tried to create a zfs pool and had a panic (tried twice, was > gifted a kernel core each time). This is on latest NetBSD-9.0_RC1 from > nyftp (Thu Jan 2 10:02:26 UTC 2020) > > The command were "zpool create angus_media wd0" and "zpool create -f > angus_media wd0 wd1" > > Moved disks to another machine (On which I'd used zfs before), on > which the latter command completed fine (modulus wd1 and wd2 as > different device numbers). > > Disks were 6TB with any labels/gpt blanked. > > Bit of a puzzler... > > crash reports: > > _KERNEL_OPT_NARCNET() at 0 > ?() at a9826c8f4000 > vpanic() at vpanic+0x169 > snprintf() at snprintf > startlwp() at startlwp > calltrap() at calltrap+0x11 > fstrans_start() at fstrans_start+0x64 > VOP_LOCK() at VOP_LOCK+0x52 > vn_lock() at vn_lock+0x11 > secmodel_extensions_system_cb() at secmodel_extensions_system_cb+0x70 > kauth_authorize_action() at kauth_authorize_action+0xaa > kauth_authorize_system() at kauth_authorize_system+0x28 > zfs_mount() at zfs_mount+0xcf > VFS_MOUNT() at VFS_MOUNT+0x4d > mount_domount() at mount_domount+0xdf > do_sys_mount() at do_sys_mount+0x580 > sys___mount50() at sys___mount50+0x33 > syscall() at syscall+0x157 > --- syscall (number 410) --- > 7d6364687aba: For some reason locking the directory we want to mount on crashes. Anything special with the root on this machine? Does the directory (/angus_media I suppose) exist? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Tar extract behaviour changed
-- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig > On 22. Oct 2019, at 07:26, Martin Husemann wrote: > > The current state silently breaks existing valid setups ("valid" of course > in my view, as I personally ran into one that I created myself). It breaks chrooted services, I got non-working "unbound" and "nsd". Suppose this will hurt a bunch of installations when they go from -8 to -9. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Tar extract behaviour changed
Somewhere between Netbsd-8 and NetBSD-9 "tar" changed its behaviour when it has to extract a directory and the path exists as a symlink. The attached script on -8 gives: NetBSD 8.0_STABLE == Initial: total 8 drwxr-xr-x 2 hannken staff 512 Oct 21 11:47 realtarget drwxr-xr-x 2 hannken staff 512 Oct 21 11:47 target == Change to symlink: total 4 drwxr-xr-x 2 hannken staff 512 Oct 21 11:47 realtarget lrwxr-xr-x 1 hannken staff 10 Oct 21 11:47 target -> realtarget == After extract: total 4 drwxr-xr-x 2 hannken staff 512 Oct 21 11:47 realtarget lrwxr-xr-x 1 hannken staff 10 Oct 21 11:47 target -> realtarget On -9 it gives: NetBSD 9.0_BETA == Initial: total 4 drwxr-xr-x 2 root wheel 512 Oct 21 11:48 realtarget drwxr-xr-x 2 root wheel 512 Oct 21 11:48 target == Change to symlink: total 2 drwxr-xr-x 2 root wheel 512 Oct 21 11:48 realtarget lrwxr-xr-x 1 root wheel 10 Oct 21 11:48 target -> realtarget == After extract: total 4 drwxr-xr-x 2 root wheel 512 Oct 21 11:48 realtarget drwxr-xr-x 2 root wheel 512 Oct 21 11:48 target Here "target" was changed from symlink to directory. On NetBSD-9 extracting the "base" set overrides the symlink "/etc/unbound" with a directory and therefore unbound fails to start. Is this a bug in "tar" or is there a switch to get the old behaviour back? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) #! /bin/sh uname -sr mkdir junk mkdir junk/target junk/realtarget printf "\n== Initial:\n" ; ls -l junk tar -c -f junk.tar junk rmdir junk/target ln -s realtarget junk/target printf "\n== Change to symlink:\n" ; ls -l junk tar -x -f junk.tar printf "\n== After extract:\n" ; ls -l junk rm -rf junk junk.tar
Re: i386 9.99.17 build fails for NET4501 kernel
Any chance we can build x86 kernels without DIAGNOSTIC again? Does it need a PR? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig > On 15. Oct 2019, at 17:56, John D. Baker wrote: > > On Tue, 15 Oct 2019, Ryo ONODERA wrote: > >> If the problem is the compiler bug, the patch like the following >> may be effective. > >> [snip] > > I applied a similar change for i386 and with it the stock NET4501 > config (w/"options DIAGNOSTIC" commented out) builds successfully. > > +Index: sys/arch/i386/conf/Makefile.i386 > +=== > +RCS file: /cvsroot/src/sys/arch/i386/conf/Makefile.i386,v > +retrieving revision 1.194 > +diff -u -p -r1.194 Makefile.i386 > +--- sys/arch/i386/conf/Makefile.i38622 Sep 2018 12:24:02 - 1.194 > sys/arch/i386/conf/Makefile.i38615 Oct 2019 15:52:05 - > +@@ -44,6 +44,7 @@ CFLAGS+= -mno-mmx -mno-sse -mno-avx > + CFLAGS+= -mindirect-branch=thunk > + CFLAGS+= -mindirect-branch-register > + .endif > ++COPTS.vm_machdep.c+= -Wno-error=clobbered > + > + ## > + ## (3) libkern and compat > > > > > -- > |/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X > |\ / jdbaker[snail]consolidated[flyspeck]net OpenBSDFreeBSD > | X No HTML/proprietary data in email. BSD just sits there and works! > |/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645 signature.asc Description: Message signed with OpenPGP
Re: VFS panic
> On 21. Feb 2019, at 00:18, Robert Swindells wrote: > > > I'm getting a panic at startup on an evbearmv7hf-el system: > > ... > Starting sshd. > Starting inetd. > Starting cron. > Wed Feb 20 21:23:46 GMT 2019 > panic: kernel diagnostic assertion "mp != dead_rootmount" failed: file > "../../../../kern/vfs_trans.c", line 680 > cpu1: Begin traceback... > 0x9ac8dd54: netbsd:db_panic+0x14 > 0x9ac8dd6c: netbsd:vpanic+0x194 > 0x9ac8dd84: netbsd:__aeabi_uldivmod > 0x9ac8ddbc: netbsd:vfs_suspend+0x1b8 > 0x9ac8dddc: netbsd:vrevoke_suspend_next+0x3c > 0x9ac8de14: netbsd:vrevoke+0xc4 > 0x9ac8de24: netbsd:genfs_revoke+0x20 > 0x9ac8de4c: netbsd:VOP_REVOKE+0x40 > 0x9ac8df14: netbsd:dorevoke+0x94 > 0x9ac8df34: netbsd:sys_revoke+0x44 > 0x9ac8dfac: netbsd:syscall+0x12c > cpu1: End traceback... > Undefined instruction 0xe7ff in kernel at 0x80023534 (LR 0x80265358 SP > 0x9ac > 8dd58) > Stopped in pid 621.1 (getty) at netbsd:cpu_Debugger:und 0xe7ff > db{1}> > > This is a -current kernel from sources updated about an hour ago, > userland is a couple of days old. Please try again with sys/kern/vfs_trans.c Rev. 1.55 -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Kernel crash trying to use union mount
> On 18. Jan 2019, at 14:13, Tom Ivar Helbekkmo wrote: > > I just had a really weird crash on a NetBSD/amd64-current system, > running a kernel 8.99.30 from January 2nd. Here's what happened: > > I was going to experiment with a rather large set of changes to the > local copy of the source tree, which I'd want to revert afterwards, so I > created a directory on another file system, and mounted it on top of > /usr/src with mount_union. I then copied a 10MiB diff into /usr/src/. > That went well - the file was visible in /usr/src/, and I observed that > it was correctly stored in the auxiliary directory, as expected. > > Then I tried reading the file from /usr/src/, and the system immediately > crashed, and dumped core, with the panic: > > kernel diagnostic assertion "fli->fli_trans_cnt > 0" failed: file > "/usr/src/sys/kern/vfs_trans.c", line 451 The VOP_UNLOCK() doesn't match the corresponding vn_lock(). > fstrans_done() at fstrans_done+0x126 > VOP_UNLOCK() at VOP_UNLOCK+0x5b > vput() at vput+0x11 > union_lookup1() at union_lookup1+0xfe This is while (dvp != udvp && (dvp->v_type == VDIR) && (mp = dvp->v_mountedhere)) { if (vfs_busy(mp)) continue; vput(dvp); which looks wrong. Please show your mounted file systems. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig signature.asc Description: Message signed with OpenPGP
Re: panic when removing a file in current
On Thu, Jul 19, 2018 at 01:08:22PM +0200, Johnny Billquist wrote: > Hmm. That means I need to update user land, which can be a bit scary since it > can make a rollback really hard. > And there is also a chicken and egg thing here. Installing a new user land > can potentially mean removing files, which will trigger the panic. > > Is it really motivated with that panic? The system is running without issues > on that same file system and NetBSD 7. You could backport this change to -7 fsck_ffs, the patch (attached) is small. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) Index: pass1.c === RCS file: /cvsroot/src/sbin/fsck_ffs/pass1.c,v retrieving revision 1.57 retrieving revision 1.58 diff -p -u -r1.57 -r1.58 --- pass1.c 8 Feb 2017 16:11:40 - 1.57 +++ pass1.c 13 Feb 2018 11:20:08 - 1.58 @@ -253,8 +253,9 @@ checkinode(ino_t inumber, struct inodesc (memcmp(dp->dp1.di_db, ufs1_zino.di_db, UFS_NDADDR * sizeof(int32_t)) || memcmp(dp->dp1.di_ib, ufs1_zino.di_ib, - UFS_NIADDR * sizeof(int32_t || - mode || size) { + UFS_NIADDR * sizeof(int32_t + || + mode || size || DIP(dp, blocks)) { pfatal("PARTIALLY ALLOCATED INODE I=%llu", (unsigned long long)inumber); if (reply("CLEAR") == 1) {
Re: panic when removing a file in current
> On 19. Jul 2018, at 03:54, Johnny Billquist wrote: > > Anyone seen this, or know what it's about? Great, it took 6 months to trigger my assertion ... This panic probably means the file contains unallocated inodes that were only partially zeroed. Please run "fsck -f" on this file system and look for messages like "PARTIALLY ALLOCATED INODE". > On NetBSD/vax, with 8.99.22 from today. > > Removing any file that has disk blocks allocated to it: > > [ 653.3285523] ufs_inactive: unlinked ino 50313 on "/home" has non zero size > 0 or blocks 1ac0 with allerror 0 > [ 653.3484633] panic: ufs_inactive: dirty filesystem? > [ 653.3788284] cpu0: Begin traceback... > [ 653.3984724] panic: ufs_inactive: dirty filesystem? > [ 653.4090004] Stack traceback : > [ 653.4231115] Process is executing in user space. > [ 653.4286045] cpu0: End traceback... > Stopped in pid 39.1 (rm) at netbsd:vpanic+0xc5: pushl $0 > > > If a file is small enough to have all the data in the inode itself, rm > survives fine. We never hold file data in inodes, only short sysmlinks. > > Johnny > > -- > Johnny Billquist || "I'm on a bus > || on a psychedelic trip > email: b...@softjar.se || Reading murder books > pdp is alive! || tryin' to stay hip" - B. Idol -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: netbsd-8 hang on tstile
> On 6. Mar 2018, at 23:33, Manuel Bouyer <bou...@antioche.eu.org> wrote: > > Hello > on an up-to-date netbsd-8 Xen3 i386PAE kernel I see hangs on tstile. > Hung processes shows the same pattern, they sleep in fstrans_start(): > > This is reproductible, restarting my automatic test script hangs the same > way. This i plain ffs, no wapbl. > > Any idea ? Please enter DDB and "call fstrans_dump(0)" to see which thread blocks the transition (it will have "... shared N ..." with N > 0). -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: FFS panic
Manuel, does it help to run clri from fsdb? We definitely need an assertion of "blocks == 0" on inode deletion. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) > On 15. Jan 2018, at 09:11, Manuel Bouyer <bou...@antioche.eu.org> wrote: > > Hello, > I get a recuring panic on a netbsd-8 host: > ffs_newvnode: ino=4 on /dsk/l1: gen 35a8ffda/35a8ffda has non zero blocks > af80 or size 0 > panic: ffs_newvnode: dirty filesystem? > > I remember something about a ffsv2 bug, but this filesystem is ffsv1. > fsck doesn't seem to fix it. > > Any idea ? > > -- > Manuel Bouyer <bou...@antioche.eu.org> > NetBSD: 26 ans d'experience feront toujours la difference > --
Re: Fixing swap1_stop
> On 19. Aug 2017, at 14:20, Christos Zoulas <chris...@zoulas.com> wrote: > > On Aug 19, 1:04pm, hann...@eis.cs.tu-bs.de ("J. Hannken-Illjes") wrote: > -- Subject: Re: Fixing swap1_stop > > | A long time ago forced unmounts tried to change open block device nodes > | to anonymous (not attached to a file system) nodes. This was racy and > | has been removed. > | > | With the recent changes to the VFS subsystem it should be possible to > | bring this behaviour back and instead of destroying open device nodes > | a forced unmount would detach them from the file system and keep them > | active. > | > | Did you mean something like this? > > Yes exactly that. Committed and pullup to -8 requested: src/sys/kern/vfs_vnode.c r1.97, r1.98 src/sys/miscfs/deadfs/dead_vfsops.c r1.8 src/sys/kern/vfs_mount.c r1.67 src/sys/sys/vnode_impl.h r1.16 -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Fixing swap1_stop
> On 18. Aug 2017, at 10:16, Robert Elz <k...@munnari.oz.au> wrote: > > After thinking about this (that is, the original problem here, > not the mount changes, which are useful for other reasons - the > reason I did the implementation I showed is that I have a very > similar need in some of my scripts, where I have just been "knowing" > that I never have weird chars, like spaces, in any of the mount point > names, up to now.) > > Anyway, after thinking about it for a bit, I am not convinced that > any fix in the rc scripts is the best way to solve the underlying > problem - and at best it means a bunch of messy config that users > would need to maintain. > > Might it not be better instead to fix it in the kernel, make it > so that if umount -f encounters a device node from the filesystem being > unmounted, it simply marks the vnode so it is known its home filesystem > has vanished (pointer to mount-point = NULL most probably -- so no more > attempts to update the times, which is all that the underlying filesys > is ever used for after a device is opened) and otherwise leave it alone? > > (A regular umount, without -f, would return EBUSY as normal.) > > That way the device keeps on working, until it is closed, when the vnode > can just be trashed, and in the meantime, the filesystem it came from can > be cleanly unmounted (or as cleanly as -f ever permits.) In the general case > we really want the umount to happen, as otherwise its parent filesys, which > might also need unmounting (a tmpfs with devices mounted on a tmpfs that > has none, but is occupying swap, for example.) > > Wouldn't this solve the original problem in a much simpler and better way? A long time ago forced unmounts tried to change open block device nodes to anonymous (not attached to a file system) nodes. This was racy and has been removed. With the recent changes to the VFS subsystem it should be possible to bring this behaviour back and instead of destroying open device nodes a forced unmount would detach them from the file system and keep them active. Did you mean something like this? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: ffs_vnops.c changes break kernels w/o WAPBL
> On 1 Mar 2017, at 22:19, John D. Baker <jdba...@mylinuxisp.com> wrote: > > Recent changes to "sys/ufs/ffs/ffs_vnops.c" break building kernels which > don't include "options WAPBL" (e.g. NET4501). > > The complaint is about "struct mount *mp" being set but not used. > > In the above-mentioned file in "ffs_spec_fsync(void *v)", "struct mount *mp" > is used only in a block of code guarded with "#ifdef WAPBL". > > The following patch adds the same guard to the declaration and setting > of "struct mount *mp". This allows the NET4501 kernel to build. > > +Index: sys/ufs/ffs/ffs_vnops.c > +=== > +RCS file: /cvsroot/src/sys/ufs/ffs/ffs_vnops.c,v > +retrieving revision 1.126 > +diff -u -p -r1.126 ffs_vnops.c > +--- sys/ufs/ffs/ffs_vnops.c 1 Mar 2017 10:42:45 - 1.126 > sys/ufs/ffs/ffs_vnops.c 1 Mar 2017 20:10:33 - > +@@ -283,12 +283,16 @@ ffs_spec_fsync(void *v) > + } */ *ap = v; > + int error, flags, uflags; > + struct vnode *vp; > ++#ifdef WAPBL > + struct mount *mp; > ++#endif /* WAPBL */ > + > + flags = ap->a_flags; > + uflags = UPDATE_CLOSE | ((flags & FSYNC_WAIT) ? UPDATE_WAIT : 0); > + vp = ap->a_vp; > ++#ifdef WAPBL > + mp = vp->v_mount; > ++#endif /* WAPBL */ > + > + error = spec_fsync(v); > + if (error) > > -- > |/"\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X > |\ / jdbaker[snail]mylinuxisp[flyspeck]comOpenBSD FreeBSD > | X No HTML/proprietary data in email. BSD just sits there and works! > |/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645 > Committed with slightly modification -- thanks! -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: 7.99.50 complete hang
> On 19 Dec 2016, at 14:31, Joerg Sonnenberger <jo...@bec.de> wrote: > > On Sun, Dec 18, 2016 at 09:55:58PM +0100, J. Hannken-Illjes wrote: >> >>> On 18 Dec 2016, at 21:49, Joerg Sonnenberger <jo...@bec.de> wrote: >>> >>> On Sun, Dec 18, 2016 at 09:45:00PM +0100, Joerg Sonnenberger wrote: >>>> On Fri, Dec 16, 2016 at 01:14:10AM +0100, Thomas Klausner wrote: >>>>> When I start my chrooted bulkbuild, the system completely stops. It >>>>> prints a couple of dots (like when it farms out the first steps of the >>>>> dependency chain computation) and then nothing. When I try to open a >>>>> second shell in screen, screen locks up completely. >>>> >>>> In my case, the scan finished, but it dead locked as soon as it tries to >>>> write to binary packages. This worked fine with a kernel from the ~Dec 8 >>>> sources. >>> >>> Comparing the ident output makes me suspect the vnode changes on Dec >>> 14th. Juergen? >> >> Please revert sys/kern/vfs_vnode.c to Rev 1.62 to make sure it is the >> result of this commit. > > No hang yet with 1.62. Ok, there is a problem with vdrain_vrele(). If all tests pass I will commit a fix tomorrow. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: 7.99.50 complete hang
> On 18 Dec 2016, at 21:49, Joerg Sonnenberger <jo...@bec.de> wrote: > > On Sun, Dec 18, 2016 at 09:45:00PM +0100, Joerg Sonnenberger wrote: >> On Fri, Dec 16, 2016 at 01:14:10AM +0100, Thomas Klausner wrote: >>> When I start my chrooted bulkbuild, the system completely stops. It >>> prints a couple of dots (like when it farms out the first steps of the >>> dependency chain computation) and then nothing. When I try to open a >>> second shell in screen, screen locks up completely. >> >> In my case, the scan finished, but it dead locked as soon as it tries to >> write to binary packages. This worked fine with a kernel from the ~Dec 8 >> sources. > > Comparing the ident output makes me suspect the vnode changes on Dec > 14th. Juergen? Please revert sys/kern/vfs_vnode.c to Rev 1.62 to make sure it is the result of this commit. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: repeated failure to properly shutdown
> On 22 Jul 2016, at 10:39, Robert Elzwrote: > >Date:Thu, 21 Jul 2016 16:38:57 -0700 >From:bch >Message-ID: >
Re: repeated failure to properly shutdown
> On 21 Jul 2016, at 19:26, co...@sdf.org wrote: > > On Thu, Jul 21, 2016 at 10:15:32AM -0700, bch wrote: >> Jul 20 23:55:59 kamloops /netbsd: wapbl_discard() at >> netbsd:wapbl_discard+0x20c >> Jul 20 23:55:59 kamloops /netbsd: vclean() at netbsd:vclean+0x2ae > ... >> Jul 20 23:55:59 kamloops /netbsd: tmpfs_unmount() at >> netbsd:tmpfs_unmount+0x2f > > For some reason this condition is met: > wapbl_vphaswapbl(vp) > > but why? The contents of this "struct vnode", especially its "v_tag" and "v_mount" could help. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: nbctfmerge runs for hours on custom i386 kernels
> On 02 Mar 2016, at 00:58, John D. Baker <jdba...@mylinuxisp.com> wrote: > > On Tue, 1 Mar 2016, John D. Baker wrote: > >> I have not observed this behavior when building any of the standard >> kernels, but only most if not all of my custom kernels (which simply >> include GENERIC and exclude unecessary items with the "no foo at bar" >> mechanism). > > I should note that I build all my custom kernels with the "kernel.gdb=FOO" > mechanism to produce "FOO/netbsd.gdb", in case that has any bearing on > the situation. Should be fixed now with Revision 1.4 of external/bsd/elftoolchain/dist/libdwarf/libdwarf_elf_init.c -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: nbctfmerge runs for hours on custom i386 kernels
> On 03 Mar 2016, at 19:44, Christos Zoulas <chris...@astron.com> wrote: > > In article <5e50530c-0628-46e0-9457-440804e21...@eis.cs.tu-bs.de>, > J. Hannken-Illjes <hann...@eis.cs.tu-bs.de> wrote: >> >>> On 02 Mar 2016, at 09:11, Martin Husemann <mar...@duskware.de> wrote: >>> >>> On Tue, Mar 01, 2016 at 05:42:51PM -0600, John D. Baker wrote: >>>> I have so-far observed this only on i386 and not any of the other >>>> architectures I build. >>> >>> This is probably caused by ld.elf_so bugs. There is a pullup request >>> pending to (hopefully) fix this. >>> >>> A simple test (and easy workaround) is to extract usr/libexec/ld.elf_so >>> from a -current i386 base.tgz and put that on your machine. >> >> For me it is nbctfconvert that creates bad ctf sections on i386 and makes >> nbctfmerge run for an hour on debug (-g) kernels. >> >> Reverting this >> >> @@ -108,5 +122,6 @@ _dwarf_elf_relocate(Dwarf_Debug dbg, Elf >> } >> >> - if (sh.sh_type != SHT_RELA || sh.sh_size == 0) >> + if ((sh.sh_type != SHT_REL && sh.sh_type != SHT_RELA) || >> +sh.sh_size == 0) >> continue; >> >> from the recent change to >> external/bsd/elftoolchain/dist/libdwarf/libdwarf_elf_init.c >> makes my builds happy again. >> >> Looks like the the .debug_info section gets modified to always return >> the string at offset 0. > > This is part of this change: > https://svnweb.freebsd.org/base/head/contrib/elftoolchain/libdwarf/libdwarf_elf_init.c?r1=278593=278611 Yes — and it looks wrong, at least for out i386 objects that use “SHT-REL” for “.debug_info” where amd64 for example uses “SHT_RELA". According to the “TIS ELF Specification” page 1-23: As shown above, only Elf32_Rela entries contain an explicit addend. Entries of type Elf32_Rel store an implicit addend in the location to be modified. Depending on the processor architecture, one form or the other might be necessary or more convenient. Consequently, an implementation for a particular machine may use one form exclusively or either form depending on context. but function libdwarf_elf_init.c::_dwarf_elf_apply_rel_reloc() ignores this “implicit addend” and treats it as zero. Take for example file vers.o from a “-g” i386 kernel build: RELOCATION RECORDS FOR [.debug_info]: OFFSET TYPE VALUE 0006 R_386_32 .debug_abbrev 000c R_386_32 .debug_str 0011 R_386_32 .debug_str where .debug_info looks like: Contents of section .debug_info: b201 0400 00000401 1501 0010 01270200 007f 00020106 Here location 0x000c is not zero but this value gets overwritten with zero. This leads to all strings to be relocated to zero. Running “nbctfconvert” with “env CTFCONVERT_DEBUG_LEVEL=9” will show it in more detail. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: nbctfmerge runs for hours on custom i386 kernels
> On 02 Mar 2016, at 09:11, Martin Husemann <mar...@duskware.de> wrote: > > On Tue, Mar 01, 2016 at 05:42:51PM -0600, John D. Baker wrote: >> I have so-far observed this only on i386 and not any of the other >> architectures I build. > > This is probably caused by ld.elf_so bugs. There is a pullup request > pending to (hopefully) fix this. > > A simple test (and easy workaround) is to extract usr/libexec/ld.elf_so > from a -current i386 base.tgz and put that on your machine. For me it is nbctfconvert that creates bad ctf sections on i386 and makes nbctfmerge run for an hour on debug (-g) kernels. Reverting this @@ -108,5 +122,6 @@ _dwarf_elf_relocate(Dwarf_Debug dbg, Elf } - if (sh.sh_type != SHT_RELA || sh.sh_size == 0) + if ((sh.sh_type != SHT_REL && sh.sh_type != SHT_RELA) || +sh.sh_size == 0) continue; from the recent change to external/bsd/elftoolchain/dist/libdwarf/libdwarf_elf_init.c makes my builds happy again. Looks like the the .debug_info section gets modified to always return the string at offset 0. Running nbctfconvert with "CTFCONVERT_DEBUG_LEVEL=9” I get from an amd64 object: DEBUG: NO stabs: .stab=-1, .stabstr=0 DEBUG: DWARF version: 4 DEBUG: DWARF emitter: GNU C 4.8.5 -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -mno-avx -msoft-float -mtune=nocona -march=x86-64 -g -O2 -std=gnu99 -std=gnu99 -ffreestanding -fno-zero-initialized-in-bss -fno-omit-frame-pointer -fstack-protector -fno-strict-aliasing -fno-common --param ssp-buffer-size=1 DEBUG: CU name: vers.c DEBUG: die 29 <0x1d>: create_one DEBUG: die 29: creating base type DEBUG: die 29: name "signed char" remapped to "char" and from an i386 object: DEBUG: NO stabs: .stab=-1, .stabstr=0 DEBUG: DWARF version: 4 DEBUG: DWARF emitter: long long int DEBUG: CU name: long long int DEBUG: die 29 <0x1d>: create_one DEBUG: die 29: creating base type DEBUG: die 29: name "long long int" remapped to "long long" -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: NFS related panics and hangs
On 05 Nov 2015, at 21:48, Rhialto <rhia...@falu.nl> wrote: > Looking into this: > > the occurrences of nfs_reqq are as follows: > > fs/nfs/client/nfs_clvnops.c: * nfs_reqq_mtx : Global lock, protects the > nfs_reqq list. > > Since there is no other mention of nfs_reqq_mtx in the whole syssrc > tarball, this looks wrong. It also immediately causes the suspicion > that the list isn't in fact protected at all. This file (fs/nfs/client/nfs_clvnops.c) is part of a second (dead) nfs implementation from FreeBSD. It is not part of any kernel. Our nfs lives in sys/nfs. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Killing a zombie process?
On 15 Oct 2015, at 00:21, Rhialto <rhia...@falu.nl> wrote: > On Wed 14 Oct 2015 at 09:39:40 +0200, J. Hannken-Illjes wrote: >> Looks like a deadlock, two threads in tstile. >> >> Please take a backtrace (with arguments) of these threads. > > I've got a whole lot more in tstile, and that is even just from running > pkg_comp in the chroot. I didn't try to interrupt anything yet. > > load averages: 0.00, 0.20, 0.44; up 0+02:23:43 > 22:43:52 > 78 processes: 76 sleeping, 2 on CPU > CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle > Memory: 393M Act, 60K Inact, 31M Wired, 31M Exec, 273M File, 3239M Free > Swap: 4096M Total, 4096M Free > > > vargaz:~$ ps alxtp1 > UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND > 1000 139174 0 85 0 13208 2528 waitIs ttyp1 0:00.02 -bash > 0 1759 1391 1107 85 0 13304 1576 waitIttyp1 0:00.13 /bin/sh > /usr/pkg/sbin/pkg_comp chroot > 0 865 1759 1107 85 0 13304 1140 waitIttyp1 0:00.01 /bin/sh > /pkg_comp/tmp/pkg_comp-sOjsoA.sh > 0 874 865 13547 82 0 11088 1412 pause Ittyp1 0:00.01 /bin/ksh > 0 267 874 20048 81 0 15360 1720 waitI+ ttyp1 0:00.22 /bin/sh > -e /usr/pkg/sbin/pkg_chk > 0 9782 267 20048 81 0 15360 1448 waitI+ ttyp1 0:00.00 sh -c cd > /usr/pkgsrc/devel/mercurial && /usr/bin/make u > 0 8085 9782 0 117 0 15224 3452 tstile D+ ttyp1 0:00.14 > /usr/bin/make update CLEANDEPENDS > 0 26889 8085 29745 78 0 15360 1424 waitI+ ttyp1 0:00.00 /bin/sh > -c set -e; /usr/bin/env MAKECONF=/etc/mk.conf P > 0 14050 26889 0 117 0 15224 3444 tstile D+ ttyp1 0:00.14 > /usr/bin/make _MAKE OPSYS OS_VERSION LOWER_OPSYS _PKGSR > 0 6325 14050 22699 80 0 15360 1428 waitI+ ttyp1 0:00.00 /bin/sh > -c set -e; pkgpattern=mercurial-3.5.1;\t\t\t\t > 0 13334 6325 0 117 0 15224 3452 tstile D+ ttyp1 0:00.14 > /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS HOST_OSTYPE > 0 2892 13334 29745 78 0 15364 1444 waitI+ ttyp1 0:00.00 /bin/sh > -c set -e;\t\t\t\t\t\t\t\t exec 3<&0;\t\t\t\t\t > 0 13425 2892 29745 78 0 15364 1136 waitI+ ttyp1 0:00.00 /bin/sh > -c set -e;\t\t\t\t\t\t\t\t exec 3<&0;\t\t\t\t\t > 0 17339 13425 0 117 0 15224 3504 tstile D+ ttyp1 0:00.16 > /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS DEPENDS_TARG > 0 11893 17339 23601 80 0 15364 1432 waitI+ ttyp1 0:00.00 /bin/sh > -c set -e; pkgpattern=py27-mercurial\\>=3.5.1;\ > 0 21797 11893 0 117 0 15228 3512 tstile D+ ttyp1 0:00.18 > /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS DEPENDS_TARG > 0 1347 21797 23778 80 0 15364 1456 waitI+ ttyp1 0:00.00 /bin/sh > -c set -e;\t\t\t\t\t if test -n "" && /usr/pkg > 0 23567 1347 0 117 0 15228 4032 tstile D+ ttyp1 0:00.38 > /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS DEPENDS_TARG > 0 3383 23567 29360 78 0 15364 1432 waitI+ ttyp1 0:00.00 /bin/sh > -c (cd /pkg_comp/obj/pkgsrc/devel/py-mercurial/ > 0 21311 3383 28277 79 0 81652 11580 waitI+ ttyp1 0:00.14 > /usr/pkg/bin/python2.7 setup.py build > 0 24114 21311 28277 79 0 15364 1424 waitI+ ttyp1 0:00.01 /bin/sh > /pkg_comp/obj/pkgsrc/devel/py-mercurial/default > 0 3590 24114 28277 79 0 15364 1472 waitI+ ttyp1 0:00.00 /bin/sh > /usr/pkgsrc/mk/tools/msgfmt.sh > 0 7060 3590 28277 117 0 4244 188 tstile D+ ttyp1 0:00.00 /bin/cat > 0 18497 3590 28277 79 0 10880 1064 pipe_wr I+ ttyp1 0:00.00 /bin/cat > i18n/el.po > 0 23883 3590 0 117 0 6580 236 netio D+ ttyp1 0:00.00 > /usr/bin/msgfmt -v -o mercurial/locale/el/LC_MESSAGES/h > 0 27257 3590 28277 117 0 4244 188 tstile D+ ttyp1 0:00.00 /bin/cat > 0 29472 3590 28277 79 0 14244 2344 pipe_wr I+ ttyp1 0:00.01 > /usr/bin/awk -f /usr/bin/awk > > (I've re-arranged the order to get parents before children) > > Here are backtraces of the processes in tstile (and the shell that > spawned the 4 leaf children). I have kept the dump so I can examine it > further. > > Unfortunately, crash(8) didn't give me arguments, nor did ddb when I > tried that (I used the GENERIC kernel, what options do I need to get the > arguments?) > > Script started on Wed Oct 14 23:41:43 2015 > vargaz:~/crash$ crash -M netbsd.3.core -N netbsd.test > Crash version 7.0, image version 7.99.21. > WARNING: versions differ, you may not be able to examine this image. > System panicked: dump forced via kernel debugger > Backtrace from time of crash is available. > > > crash> bt/t 0t3590 > trace: pid 3590 lid 1 at 0xff
Re: Killing a zombie process?
On 16 Oct 2015, at 13:44, Rhialto <rhia...@falu.nl> wrote: > On Thu 15 Oct 2015 at 20:12:44 +0200, Rhialto wrote: >> On Thu 15 Oct 2015 at 06:57:42 +0700, Robert Elz wrote: >>> Do you really need that mounted twice like that, and if not, can you try >>> with one of them missing and see if the problem remains ? >> >> Good idea, I'll try that later! > > "Interesting" results: it built packages overnight (from around 22:30 to > 12:13, so for nearly 14 hours), then, when I didn't look, it rebooted. With panic? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Killing a zombie process?
On 14 Oct 2015, at 00:20, Rhialto <rhia...@falu.nl> wrote: > I may have something similar; with 7.0/amd64 GENERIC kernel. > > I've been doing builds in pkg_comp with the chroot directory and /usr/pkgsrc > mounted over nfs. After some packages, some processes simply don't terminate. > > Some of my processes are now (after trying to exit pkg_comp which hangs) > > UID PID PPID CPU PRI NIVSZ RSS WCHAN STAT TTY TIME COMMAND > 0 402 10 85 0 15360 1428 waitIpts/2 0:00.00 /bin/sh > -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l > -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f > 1000 683 29070 85 0 13224 2588 waitIs pts/2 0:00.03 -bash > 0 2847 683 257 117 0 13304 1576 tstile D+ pts/2 0:00.02 /bin/sh > /usr/pkg/sbin/pkg_comp chroot > 0 14284 10 85 0 15360 1428 waitIpts/2 0:00.00 /bin/sh > -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l > -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f > 0 26291 402 708 117 0 15360 1004 tstile Dpts/2 0:00.00 /bin/sh > -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l > -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f > 0 28266 142840 116 0 15360 1004 netio Dpts/2 0:00.01 /bin/sh > -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l > -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f > > No zombies involved, though. Looks like a deadlock, two threads in tstile. Please take a backtrace (with arguments) of these threads. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) signature.asc Description: Message signed with OpenPGP using GPGMail
Re: coretemp0: workqueue busy: updates stopped
On 24 Jun 2015, at 10:51, Paul Goyette p...@vps1.whooppee.com wrote: snip There is a rather interesting mutex-dance in sme_check_events() about which I need to think: mutex_enter(wq_mutex) check for empty wq mutex_exit(wq_mutex) mutex_enter(global_sysmon_mutex) mutex_enter(wq_mutex) queue up the wq entries mutex_exit(wq_mutex) check for low_power condition mutex_exit(global_sysmon_mutex) I'm pretty sure this can be reduced a bit: mutex_enter(global_sysmon_mutex) mutex_enter(wq_mutex) check for empty wq queue up the wq entries mutex_exit(wq_mutex) check for low_poer condition mutex_exit(global_sysmon_mutex) It can't, see rev. 1.114: Add a counter of busy events and stop enqueueing more work if a device is busy. Protect this counter with a new short time lock sme_work_mtx and keep sme_mtx as long time lock. Removes a deadlock where an active event holds sme_mtx, the callout sme_events_check blocks on sme_mtx and callout processing stops. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: coretemp0: workqueue busy: updates stopped
On 23 Jun 2015, at 12:01, John D. Baker jdba...@mylinuxisp.com wrote: Last night upon updating my file server from 7.0_BETA to 7.0_RC1 (amd64), the message in the Subject: line appeared during the shutdown/reboot sequence and the machine was stuck there. Does it happen on every reboot or did you see it once? Backtrace (bt) and status (ps /l) from ddb would help. I dropped to DDB and issued the reboot command. While the filesystem on the raidframe RAID (RAID-R) was unmounted, the RAID itself had not yet been detached/un-configured. The forcible shutdown required a parity rebuild upon reboot. Has anyone experienced a similar hang on shutdown/reboot? I didn't bother recording the backtrace in DDB as I just wanted my file server back up and running... -- |/\ John D. Baker, KN5UKS NetBSD Darwin/MacOS X |\ / jdbaker[snail]mylinuxisp[flyspeck]comOpenBSDFreeBSD | X No HTML/proprietary data in email. BSD just sits there and works! |/ \ GPGkeyID: D703 4A7E 479F 63F8 D3F4 BD99 9572 8F23 E4AD 1645 -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: DoS attack against TCP services
On 12 Mar 2015, at 20:59, Christos Zoulas chris...@zoulas.com wrote: | | Now we have a deadlock, softlck/0 waits for the mutex and therefore | | callouts will no longer be processed and ciss holds the mutex and waits | | for a callout through cv_timedwait. | | Thanks for looking into it! Part of the ciss_ioctl_vol() (the pdid part) | does things with XS_CTL_POLL so that it does not involve any mutexes. It | would be simple to change the ldid part to do the same. Should we do that? The mutex involved is the sme_mtx protecting the struct sysmon_envsys, so our problem doesn't come from missing POLL. | | - Sleeping up to 60 seconds in a function used by a callout is wrong. | | Yes, but many disk drivers seem to violate that. How do we fix this? | Making a separate thread that updates statistics for each driver seems | suboptimal? We already have it. If I understand sysmon right, it is already based on workqueues (the ciss0 thread here): The workqueue updates sensors every sme-sme_events_timeout seconds, default is 30 seconds. Workqueue items get enqueued from a callout. Both running a workqueue item and processing the callout locks the same mutex sme-sme_mtx. For this to work the workqueue must complete before the callout fires: sme-sme_nsensors * ccb-ccb_xs-timeout sme-sme_events_timeout In our ciss case we could set: sc-sc_sme-sme_events_timeout = 30 ccb-ccb_xs-timeout= 20 / sc-maxunits to become safe. Hope I got this right so far ... -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: DoS attack against TCP services
On 13 Mar 2015, at 13:03, Christos Zoulas chris...@zoulas.com wrote: On Mar 13, 1:00pm, hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) wrote: -- Subject: Re: DoS attack against TCP services | This would be simple, changing dev/ic/ciss.c like: | | sc-sc_sme-sme_name =3D device_xname(sc-sc_dev); | sc-sc_sme-sme_cookie =3D sc; | sc-sc_sme-sme_refresh =3D ciss_sensor_refresh; | + sc-sc_sme-sme_events_timeout =3D 60; | | should do the job. Unfortunately I have no hardware to test. Yes, but is 60 enough? Leaving the calculation to each driver is potentially dangerous. Could we make it self adjusting? This was just an idea ... Maybe ...xs..timeout * sc-maxunits + 10 and set xs timeout to 1 .. 5 seconds? I don't think it is possible to make it self adjusting as the sysmon framework doesn't know the drivers timeouts. | Nevertheless, I think that the big problem with ciss is now | fixed (i.e. it will not hang forever anymore)... | | It may still wait longer than 30 seconds with the sme_mutex held | leading to deadlock. | | We should use a suitable xs timeout vs. events timeout to make it safe, | either increase the event timeout or decrease the xs timeout. It would be nice if it was safe by default, and it should spam the kernel if it was late so that we know about it... Unfortunately it may deadlock BEFORE it finds a non-empty workqueue. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: DoS attack against TCP services
On 13 Mar 2015, at 12:53, Christos Zoulas chris...@zoulas.com wrote: On Mar 13, 11:19am, hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) wrote: -- Subject: Re: DoS attack against TCP services | The mutex involved is the sme_mtx protecting the struct sysmon_envsys, so | our problem doesn't come from missing POLL. That's what I thought. | We already have it. If I understand sysmon right, it is already based on | workqueues (the ciss0 thread here): | | The workqueue updates sensors every sme-sme_events_timeout seconds, defaul= | t | is 30 seconds. Workqueue items get enqueued from a callout. | | Both running a workqueue item and processing the callout locks the | same mutex sme-sme_mtx. | | For this to work the workqueue must complete before the callout fires: | | sme-sme_nsensors * ccb-ccb_xs-timeout sme-sme_events_timeout | | In our ciss case we could set: | | sc-sc_sme-sme_events_timeout =3D 30 | | ccb-ccb_xs-timeout=3D 20 / sc-maxunits | | to become safe. | | Hope I got this right so far ... Yes, you do. We could decrease the timeout for probing, but that might lead to unsuccessful sensor reads. Even then perhaps there is a place to have a special mode for sysmon to use a separate thread for reading the sensors of a particular driver, or a way to change the sysmon period to be longer. This would be simple, changing dev/ic/ciss.c like: sc-sc_sme-sme_name = device_xname(sc-sc_dev); sc-sc_sme-sme_cookie = sc; sc-sc_sme-sme_refresh = ciss_sensor_refresh; + sc-sc_sme-sme_events_timeout = 60; should do the job. Unfortunately I have no hardware to test. Nevertheless, I think that the big problem with ciss is now fixed (i.e. it will not hang forever anymore)... It may still wait longer than 30 seconds with the sme_mutex held leading to deadlock. We should use a suitable xs timeout vs. events timeout to make it safe, either increase the event timeout or decrease the xs timeout. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: DoS attack against TCP services
On 28 Feb 2015, at 21:05, Christos Zoulas chris...@zoulas.com wrote: On Feb 28, 8:26pm, 6b...@6bone.informatik.uni-leipzig.de (6b...@6bone.informatik.uni-leipzig.de) wrote: -- Subject: Re: DoS attack against TCP services | On Sat, 28 Feb 2015, Christos Zoulas wrote: | | Yes, that's a good start but we need to find which process that | lwp belongs to. | | I'm not sure what the best course of action is. The machine is still | running. Should you try to get the information from the current system or | force a dump and analyze this? | | On Sat, 28 Feb 2015, J. Hannken-Illjes wrote: | | Looks unlocked -- what about a backtrace of thread 0.5, | bt /a 0xfe882df11860 | | https://www.ipv6.uni-leipzig.de/bt_0xfe882df11860.png So who else is holding the sysmon sme_mtx? Analyzed a crash dump and found two threads deadlocked. 0 77 3 0 200 fe813b495b60 ciss0 ciss_cmd 05 3 0 200 fe882df11860 softclk/0 tstile Backtrace of softclk/0: ... 3 mutex_vector_enter sys/kern/kern_mutex.c:682 4 sme_events_check sys/dev/sysmon/sysmon_envsys_events.c:734 5 callout_softclock sys/kern/kern_timeout.c:743 6 softint_executesys/kern/kern_softint.c:589 ... Here the event struct sme is: sme_name = ciss0 sme_mtx.u.mtxa_owner = 0xfe813b495b62 (Thread ciss0) Backtrace of ciss0: ... 2 cv_timedwait sys/kern/kern_condvar.c:261 3 ciss_cmd sys/dev/ic/ciss.c:542 4 ciss_ldidsys/dev/ic/ciss.c:883 5 ciss_ioctl_vol sys/dev/ic/ciss.c:1388 6 ciss_sensor_refresh sys/dev/ic/ciss.c:1544 7 sysmon_envsys_refresh_sensor sys/dev/sysmon/sysmon_envsys.c:2027 8 sme_events_workersys/dev/sysmon/sysmon_envsys_events.c:769 9 workqueue_runlistsys/kern/subr_workqueue.c:104 10 workqueue_worker sys/kern/subr_workqueue.c:135 ... The sme mutex was locked from sme_events_worker at sysmon_envsys_events.c:760. Now we have a deadlock, softlck/0 waits for the mutex and therefore callouts will no longer be processed and ciss holds the mutex and waits for a callout through cv_timedwait. Taking a closer look at the poll loop from sys/dev/ic/ciss.c:537 ... this code looks wrong in many aspects: - Sleeping up to 60 seconds in a function used by a callout is wrong. - Examining variables here we get: tick = 1, etick = 16000, tohz = 6000 and i = 599. As tick is constant (us per hz) this loop might run for 599*60 seconds! -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: DoS attack against TCP services
On 12 Mar 2015, at 20:00, Christos Zoulas chris...@zoulas.com wrote: On Mar 12, 12:20pm, hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) wrote: -- Subject: Re: DoS attack against TCP services | Now we have a deadlock, softlck/0 waits for the mutex and therefore | callouts will no longer be processed and ciss holds the mutex and waits | for a callout through cv_timedwait. Thanks for looking into it! Part of the ciss_ioctl_vol() (the pdid part) does things with XS_CTL_POLL so that it does not involve any mutexes. It would be simple to change the ldid part to do the same. Should we do that? | Taking a closer look at the poll loop from sys/dev/ic/ciss.c:537 ... this | code looks wrong in many aspects: | | - Sleeping up to 60 seconds in a function used by a callout is wrong. Yes, but many disk drivers seem to violate that. How do we fix this? Making a separate thread that updates statistics for each driver seems suboptimal? | - Examining variables here we get: tick =3D 1, etick =3D 16000, | tohz =3D 6000 and i =3D 599. As tick is constant (us per hz) | this loop might run for 599*60 seconds! I committed a fix for this. Now it should only sleep up to 60 seconds. Looks like you made it worse. tick is constant, for HZ == 100 it is 1 so you now have etick = tick + tohz - etick = 1 + tohz and then tohz = etick - tick - tohz = (1 + tohz) - 1 - tohz = tohz so ciss_wait() may now loop forever. Are you looking for hardclock_ticks? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: rump i386 cross compile trouble
On 07 Mar 2015, at 18:03, Patrick Welche pr...@cam.ac.uk wrote: On Tue, Mar 03, 2015 at 02:56:16PM +, Patrick Welche wrote: On Tue, Mar 03, 2015 at 12:10:15PM +, Patrick Welche wrote: Not having much luck.. with today's source I see: No DBG / optimisation anywhere. Additions to /etc/mk.conf: RUMP_DIAGNOSTIC=yes RUMP_DEBUG=yes RUMP_LOCKDEBUG=yes RUMP_KTRACE=yes Removing these gets a successful build - that narrows it down a bit... Bisection just yielded a surprise: the build with RUMP_DEBUG=yes was broken by sys/kern/kern_module.c revision 1.103 date: 2015-02-28 23:04:34 +; author: jmcneill; state: Exp; lines: +4 -3; commitid: X5g1KIdu4fu6uPby; if the root file-system is not yet mounted, hide vfs load failed spam with opt ions DEBUG - if (modclass != MODULE_CLASS_EXEC || error != ENOENT) + if ((modclass != MODULE_CLASS_EXEC || error != ENOENT) + root_device != NULL) but why? Compiler bug (gcc)? Given this fragment is #ifdef DEBUG it looks like rump_server has to be linked with librumpvfs in the DEBUG case? Antti? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: DoS attack against TCP services
On 28 Feb 2015, at 16:28, Christos Zoulas chris...@zoulas.com wrote: On Feb 28, 11:37am, 6b...@6bone.informatik.uni-leipzig.de (6b...@6bone.informatik.uni-leipzig.de) wrote: -- Subject: Re: DoS attack against TCP services | On Fri, 13 Feb 2015, Christos Zoulas wrote: | | I tried adding show callout to crash(8) but it is not useful because the | pointers move too quickly. OTOH, next time this happens you can enter ddb | on your machine and type show callout and see if that sheds any light | to the expired and not fired callouts... | | christos | | | The problem occurred again. I have created a couple of screenshots. | Unfortunately I can not interpret the output. | | https://www.ipv6.uni-leipzig.de/callout_1.png | https://www.ipv6.uni-leipzig.de/callout_2.png | https://www.ipv6.uni-leipzig.de/callout_3.png | https://www.ipv6.uni-leipzig.de/callout_4.png | https://www.ipv6.uni-leipzig.de/callout_x.png | | | Thank your for your efforts So all the timeouts have expired and are not firing anymore (negative times). This would indicate something broken with interrupts... Let me see where we can add some debugging... Anyone holding proc_lock? I had a similar problem with fstrans where it was a deadlock with proc_lock preventing timer_intr() to succeed and therefore all timers stopped working. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: DoS attack against TCP services
On 28 Feb 2015, at 18:20, 6b...@6bone.informatik.uni-leipzig.de wrote: On Sat, 28 Feb 2015, Christos Zoulas wrote: Good idea. You can use crash, ps and see what each process is holding... christos Here the output from crash and ps gate# crash Crash version 7.0_BETA, image version 7.99.5. WARNING: versions differ, you may not be able to examine this image. Output from a running system is unreliable. crash ps PIDLID S CPU FLAGS STRUCT LWP * NAME WAIT snip 05 3 0 200 fe882df11860 softclk/0 tstile This one looks bad. Which thread holds proc_lock? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: DoS attack against TCP services
On 28 Feb 2015, at 19:39, 6b...@6bone.informatik.uni-leipzig.de wrote: On Sat, 28 Feb 2015, J. Hannken-Illjes wrote: This one looks bad. Which thread holds proc_lock? Helps this? https://www.ipv6.uni-leipzig.de/proc_lock.png Looks unlocked -- what about a backtrace of thread 0.5, bt /a 0xfe882df11860 -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: FUSE crashes on i386 and amd64
On 31 Oct 2014, at 17:09, Tom Ivar Helbekkmo t...@hamartun.priv.no wrote: I'm experimenting with MooseFS on NetBSD/i386 and /amd64, current as of September 9th. Please update -- hopefully fixed with Rev. 1.34 of fs/puffs/puffs_node.c -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Testing 7.0 Beta: FFS still very slow when creating files
On 25 Aug 2014, at 17:39, Taylor R Campbell riastr...@netbsd.org wrote: Date: Mon, 25 Aug 2014 15:55:53 +0200 From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16. Could this be the cause of this big loss in performance? Shouldn't be too hard to test this. Perhaps try dropping in the following replacements for the vcache key comparison and running the test for each one? memequal.c We are talking about a kernel from 2012/09 -- vcache came much later. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Testing 7.0 Beta: FFS still very slow when creating files
On 25 Aug 2014, at 15:55, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: On 24 Aug 2014, at 18:57, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: snip I tried to bisect and got an increase in time from ~15 secs to ~24 secs between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'. Someone should redo this test as this interval is the import of the compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools. I cant believe this to be a compiler problem. GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16. Could this be the cause of this big loss in performance? Short answer: it is -- reverting external/gpl3/gcc/dist/gcc/builtins.c from Rev. 1.3 to 1.2 brings back the old times which are the same as they were on NetBSD 6. Given that this test has many calls to ufs_lookup/cache_lookup using memcmp to check for equal filenames this is not a surprise. A rather naive implementation of memcmp (see below) drops the running time from ~15 sec to ~9 secs. We should consider improving our memcmp. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) Index: libkern.h === RCS file: /cvsroot/src/sys/lib/libkern/libkern.h,v retrieving revision 1.106 diff -p -u -2 -r1.106 libkern.h --- libkern.h 30 Aug 2012 12:16:49 - 1.106 +++ libkern.h 25 Aug 2014 17:23:35 - @@ -262,5 +262,18 @@ void *memset(void *, int, size_t); #if __GNUC_PREREQ__(2, 95) !defined(_STANDALONE) #definememcpy(d, s, l) __builtin_memcpy(d, s, l) -#definememcmp(a, b, l) __builtin_memcmp(a, b, l) +static inline int __memcmp(const void *a, const void *b, size_t l) +{ + const unsigned char *pa = a, *pb = b; + + if (l 8) + return memcmp(a, b, l); + while (l-- 0) { + if (__predict_false(*pa != *pb)) + return *pa *pb ? -1 : 1; + pa++; pb++; + } + return 0; +} +#definememcmp(a, b, l) __memcmp(a, b, l) #endif #if __GNUC_PREREQ__(2, 95) !defined(_STANDALONE)
Re: Testing 7.0 Beta: FFS still very slow when creating files
On 22 Aug 2014, at 18:29, Taylor R Campbell riastr...@netbsd.org wrote: Date: Fri, 22 Aug 2014 17:59:37 +0200 From: Stephan stephan...@googlemail.com Has anybody an idea on this or how to track this down? At the moment, I can't even enter ddb using Strg+Alt+Esc keys for some reason. I've also seen people playing with dtrace but that doesn't seem to be included. Dtrace may be a good idea. You can use it by (a) using a kernel built with `options KDTRACE_HOOKS', (b) using a userland built with MKDTRACE=yes, (c) modload /stand/ARCH/VERSION/solaris.kmod modload /stand/ARCH/VERSION/dtrace.kmod modload /stand/ARCH/VERSION/fbt.kmod modload /stand/ARCH/VERSION/sdt.kmod (d) mkdir /dev/dtrace mknod /dev/dtrace/dtrace c dtrace (Yes, this is too much work. Someone^TM should turn it all on by default for netbsd-7...!) From the lockstat output it looks like there's a lot of use of mntvnode_lock, which suggests this may be related to hannken@'s vnode cache changes. Might be worthwhile to sample stack traces of vfs_insmntque, with something like dtrace -n 'fbt::vfs_insmntqueue:entry { @[stack()]++ }' or perhaps sample stack traces of the mutex_enters of mntvnode_lock. This was my first guess too ... I tried to bisect and got an increase in time from ~15 secs to ~24 secs between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'. Someone should redo this test as this interval is the import of the compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools. I cant believe this to be a compiler problem. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany) Btw.: my test script is: #! /bin/sh mdconfig /dev/md0d 2048000 P=${!} newfs /dev/rmd0a mount -t ffs -o log /dev/md0a /mnt (cd /mnt time sh -c 'seq 1 3|xargs touch') umount /mnt kill ${P}
RiscOS FILECORE disk image needed
Subject says it all: I'm looking for a RiscOS FILECORE disk image that is mountable and readable on NetBSD with vnconfig -rc vnd0 image mount -r -t filecore /dev/vnd0d /mnt ls -la /mnt ... -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: tmpfs lock error panic with mknod+S_IFMT
On 26 May 2014, at 16:25, Nicolas Joly nj...@pasteur.fr wrote: Hi, While testing some linux binary, i encountered a reproductible lock error when the program issued the following mknod call on a tmpfs mount : mknod(dummy, S_IFMT|0666, 0); You try to create a bad sector on tmpfs -- see do_sys_mknodat() for details. More analysis when cvs.netbsd.org is back again. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: i386 ddb trace stopped working with gcc48
On 21 Apr 2014, at 19:34, Christos Zoulas chris...@astron.com wrote: In article 910f0bed-9fd7-4c06-b886-e91002d03...@eis.cs.tu-bs.de, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: On 21 Apr 2014, at 14:39, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: Since i386 switched to gcc48 ddb trace no longer works: fatal breakpoint trap in supervisor mode trap type 1 code 0 eip c02802f4 cs 8 eflags 202 cr2 bbbab0c4 ilevel 8 esp 800 curlwp 0xc5a9fd20 pid 0 lid 2 lowest kstack 0xdd3b22c0 Stopped in pid 0.2 (system) at netbsd:breakpoint+0x4: popl%ebp db{0} bt breakpoint(c0e661c0,3f8,0,0,c61c5158,c170dacc,c6188000,c5f396c0,c5f39748,dd3b4edc) at netbsd:breakpoint+0x4 Thats all, never get more than one line. The i386_frame from %ebp = dd25ef30 looks like: dd25ef30: 7ff = should be the previous frame dd25ef34: c0277cc1= comintr+0x53e (caller of breakpoint) dd25ef38: c0e661c0 Ideas anyone? Some further notes: - The function prologue has changed as -push %ebp -mov%esp,%ebp sub$0x14,%esp call ... -leave +add$0x14,%esp ret - With -fno-omit-frame-pointer all is well. Perhaps the default changes to -fomit-frame-pointer... We should consider changing it back like we did for amd64. Do we really want to add makeoptions COPTS=... -fno-omit-frame-pointer to all i386 kernel configs like amd64 does or should it better go to sys/arch/i386/conf/Makefile.i386? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
i386 ddb trace stopped working with gcc48
Since i386 switched to gcc48 ddb trace no longer works: fatal breakpoint trap in supervisor mode trap type 1 code 0 eip c02802f4 cs 8 eflags 202 cr2 bbbab0c4 ilevel 8 esp 800 curlwp 0xc5a9fd20 pid 0 lid 2 lowest kstack 0xdd3b22c0 Stopped in pid 0.2 (system) at netbsd:breakpoint+0x4: popl%ebp db{0} bt breakpoint(c0e661c0,3f8,0,0,c61c5158,c170dacc,c6188000,c5f396c0,c5f39748,dd3b4edc) at netbsd:breakpoint+0x4 Thats all, never get more than one line. The i386_frame from %ebp = dd25ef30 looks like: dd25ef30: 7ff = should be the previous frame dd25ef34: c0277cc1= comintr+0x53e (caller of breakpoint) dd25ef38: c0e661c0 Ideas anyone? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: i386 ddb trace stopped working with gcc48
On 21 Apr 2014, at 14:39, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote: Since i386 switched to gcc48 ddb trace no longer works: fatal breakpoint trap in supervisor mode trap type 1 code 0 eip c02802f4 cs 8 eflags 202 cr2 bbbab0c4 ilevel 8 esp 800 curlwp 0xc5a9fd20 pid 0 lid 2 lowest kstack 0xdd3b22c0 Stopped in pid 0.2 (system) at netbsd:breakpoint+0x4: popl%ebp db{0} bt breakpoint(c0e661c0,3f8,0,0,c61c5158,c170dacc,c6188000,c5f396c0,c5f39748,dd3b4edc) at netbsd:breakpoint+0x4 Thats all, never get more than one line. The i386_frame from %ebp = dd25ef30 looks like: dd25ef30: 7ff = should be the previous frame dd25ef34: c0277cc1 = comintr+0x53e (caller of breakpoint) dd25ef38: c0e661c0 Ideas anyone? Some further notes: - The function prologue has changed as -push %ebp -mov%esp,%ebp sub$0x14,%esp call ... -leave +add$0x14,%esp ret - With -fno-omit-frame-pointer all is well. Does it ring any bell? -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: Cannot execute elf binary ...
On Dec 21, 2013, at 6:21 PM, Kurt Schreiner k...@ub.uni-mainz.de wrote: a kernel compiled from source cvs updated some minutes ago can't exec elf binaries anymore; seen on i386 and arm, screenshot of i386-VM attached. Just revert src/sys/kern/exec_elf.c to 1.51. Running strnlen() on a fresh allocated memory region looks strange. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)