Re: build.sh kernel does not finish with endless nbctfmerge run

2024-04-01 Thread J. Hannken-Illjes
Oops, forgot to "cvs update" -- all builds are working for me.

Sorry for the noise ...

--
J. Hannken-Illjes - hann...@mailbox.org



> On 31. Mar 2024, at 12:14, J. Hannken-Illjes  wrote:
> 
>> On 31. Mar 2024, at 11:23, Ryo ONODERA  wrote:
>> 
>> chris...@astron.com (Christos Zoulas) writes:
>> 
>>> In article <4b5a66e1-7a3e-48ce-9ace-f9249e75f...@mailbox.org>,
>>> J. Hannken-Illjes  wrote:
>>>> I also added an abort() when _dwarf_get_reloc_size() returns on
>>>> "/* unknown relocation. */" and this killed nbctfconvert() as
>>>> 
>>>> _dwarf_get_reloc_size ()
>>>> _dwarf_elf_init ()
>>>> dwarf_elf_init ()
>>>> dw_read ()
>>>> main ()
>>>> 
>>>> For me nbctfmerge on kernels succeeds after up to 30 minutes, but the
>>>> resulting CTF sections are long too big and look very strange:
>>> 
>>> Yes, I also reproduced it. Back to the drawing board...
>>> 
>>> christos
>>> 
>> 
>> Anyway I can finish build.sh kernel=GENERIC now.
>> Thanks for your investigations.
> 
> Unfortunately broken again (for read only source at least) ...
> 
> After Taylors commits last night:
> 
> cvs rdiff -u -r1.217 -r1.218 src/tools/Makefile
> cvs rdiff -u -r1.2 -r0 src/tools/elftoolchain/Makefile
> cvs rdiff -u -r1.5 -r1.6 src/tools/elftoolchain/libdwarf/Makefile
> 
> my clean build of amd64, i386 and sparc64 succeeded without any
> problem.  With your commit this morning:
> 
> cvs rdiff -u -r1.6 -r1.7 \
>   src/external/bsd/elftoolchain/dist/libdwarf/libdwarf_reloc.c
> cvs rdiff -u -r1.8 -r1.9 src/tools/Makefile.nbincludes
> cvs rdiff -u -r1.3 -r1.4 src/tools/elftoolchain/common/sys/Makefile
> cvs rdiff -u -r0 -r1.1 src/tools/elftoolchain/common/sys/elfdefinitions.h
> 
> amd64 and i386 still build, but sparc64 fails with:
> 
> sh: cannot create elfdefinitions.h: read-only file system
> --- elfdefinitions.h ---
> *** Failed target: elfdefinitions.h
> *** Failed commands:
>${_MKTARGET_CREATE}
>=> @# " create " sys/elfdefinitions.h
>${TOOL_M4} -I${SRCDIR} -D SRCDIR=${SRCDIR} ${M4FLAGS}  
> elfdefinitions.m4 > ${.TARGET}
>    => /work/build/obj/tools.sparc64/bin/nbm4 
> -I/work/build/src/tools/elftoolchain/common/sys/../../../../external/bsd/elftoolchain/dist/common/sys
>  -D 
> SRCDIR=/work/build/src/tools/elftoolchain/common/sys/../../../../external/bsd/elftoolchain/dist/common/sys
>elfdefinitions.m4 > elfdefinitions.h
> *** [elfdefinitions.h] Error code 1
> 
> --
> J. Hannken-Illjes - hann...@mailbox.org



Re: build.sh kernel does not finish with endless nbctfmerge run

2024-03-31 Thread J. Hannken-Illjes
> On 31. Mar 2024, at 11:23, Ryo ONODERA  wrote:
> 
> chris...@astron.com (Christos Zoulas) writes:
> 
>> In article <4b5a66e1-7a3e-48ce-9ace-f9249e75f...@mailbox.org>,
>> J. Hannken-Illjes  wrote:
>>> I also added an abort() when _dwarf_get_reloc_size() returns on
>>> "/* unknown relocation. */" and this killed nbctfconvert() as
>>> 
>>> _dwarf_get_reloc_size ()
>>> _dwarf_elf_init ()
>>> dwarf_elf_init ()
>>> dw_read ()
>>> main ()
>>> 
>>> For me nbctfmerge on kernels succeeds after up to 30 minutes, but the
>>> resulting CTF sections are long too big and look very strange:
>> 
>> Yes, I also reproduced it. Back to the drawing board...
>> 
>> christos
>> 
> 
> Anyway I can finish build.sh kernel=GENERIC now.
> Thanks for your investigations.

Unfortunately broken again (for read only source at least) ...

After Taylors commits last night:

cvs rdiff -u -r1.217 -r1.218 src/tools/Makefile
cvs rdiff -u -r1.2 -r0 src/tools/elftoolchain/Makefile
cvs rdiff -u -r1.5 -r1.6 src/tools/elftoolchain/libdwarf/Makefile

my clean build of amd64, i386 and sparc64 succeeded without any
problem.  With your commit this morning:

cvs rdiff -u -r1.6 -r1.7 \
   src/external/bsd/elftoolchain/dist/libdwarf/libdwarf_reloc.c
cvs rdiff -u -r1.8 -r1.9 src/tools/Makefile.nbincludes
cvs rdiff -u -r1.3 -r1.4 src/tools/elftoolchain/common/sys/Makefile
cvs rdiff -u -r0 -r1.1 src/tools/elftoolchain/common/sys/elfdefinitions.h

amd64 and i386 still build, but sparc64 fails with:

sh: cannot create elfdefinitions.h: read-only file system
--- elfdefinitions.h ---
*** Failed target: elfdefinitions.h
*** Failed commands:
${_MKTARGET_CREATE}
=> @# " create " sys/elfdefinitions.h
${TOOL_M4} -I${SRCDIR} -D SRCDIR=${SRCDIR} ${M4FLAGS}  
elfdefinitions.m4 > ${.TARGET}
=> /work/build/obj/tools.sparc64/bin/nbm4 
-I/work/build/src/tools/elftoolchain/common/sys/../../../../external/bsd/elftoolchain/dist/common/sys
 -D 
SRCDIR=/work/build/src/tools/elftoolchain/common/sys/../../../../external/bsd/elftoolchain/dist/common/sys
   elfdefinitions.m4 > elfdefinitions.h
*** [elfdefinitions.h] Error code 1

--
J. Hannken-Illjes - hann...@mailbox.org


Re: build.sh kernel does not finish with endless nbctfmerge run

2024-03-30 Thread J. Hannken-Illjes
I also added an abort() when _dwarf_get_reloc_size() returns on
"/* unknown relocation. */" and this killed nbctfconvert() as

_dwarf_get_reloc_size ()
_dwarf_elf_init ()
dwarf_elf_init ()
dw_read ()
main ()

For me nbctfmerge on kernels succeeds after up to 30 minutes, but the
resulting CTF sections are long too big and look very strange:

Before this change I had

  total number of types   = 30015
  total number of integers= 65
  total number of floats  = 1
  total number of pointers= 7902
  total number of arrays  = 3515
  total number of func types  = 2252

and now i have

  total number of types   = 322862
  total number of integers= 3865
  total number of floats  = 1
  total number of pointers= 127495
  total number of arrays  = 14350
  total number of func types  = 65411

and running the merge with CTFMERGE_DEBUG_LEVEL=2 I get

ERROR: nbctfmerge: Second pass for 5978 ((anon)) == 13434

--
J. Hannken-Illjes - hann...@mailbox.org



> On 30. Mar 2024, at 15:23, Christos Zoulas  wrote:
> 
> I don't think that's the problem. I added abort() calls just before the 
> return 0 and
> they never fire for me (and the kernel built has the right CTF information). 
> Nevertheless
> I think that the relocation code is not used in the CTF code; it just parsers 
> the debug
> dwarf into and builds CTF stabs from them. I think that the threading code in 
> ctf is
> problematic because we had this problem in the past.
> 
> christos
> 
>> On Mar 29, 2024, at 9:43 PM, Ryo ONODERA  wrote:
>> 
>> Hi,
>> 
>> The following two commits cause endless nbctfmerge run
>> at end of build.sh kernel=GENERIC for me.
>> My environment is NetBSD/amd64 10.99.10 of yesterday.
>> 
>> Could you investigate my problem?
>> 
>> Module Name: src
>> Committed By: christos
>> Date: Wed Mar 27 21:53:06 UTC 2024
>> 
>> Modified Files:
>> src/external/bsd/elftoolchain/dist/libdwarf: libdwarf_reloc.c
>> 
>> Log Message:
>> Don't try to compile the arch-specific relocation code if we don't have the
>> built-in headers (for tools)
>> 
>> To generate a diff of this commit:
>> cvs rdiff -u -r1.5 -r1.6 \
>>   src/external/bsd/elftoolchain/dist/libdwarf/libdwarf_reloc.c
>> 
>> Module Name: src
>> Committed By: christos
>> Date: Wed Mar 27 21:54:43 UTC 2024
>> 
>> Modified Files:
>> src/tools: Makefile.nbincludes
>> src/tools/elftoolchain/libdwarf: Makefile
>> 
>> Log Message:
>> Remove dependency to elfdefinitions.h, this is a mess, since it needs
>> ${TOOL_M4} which might not be available yet.
>> 
>> To generate a diff of this commit:
>> cvs rdiff -u -r1.7 -r1.8 src/tools/Makefile.nbincludes
>> cvs rdiff -u -r1.4 -r1.5 src/tools/elftoolchain/libdwarf/Makefile
>> 
>> 
>> Thank you.
>> 
>> -- 
>> Ryo ONODERA // r...@tetera.org
>> PGP fingerprint = 82A2 DC91 76E0 A10A 8ABB  FD1B F404 27FA C7D1 15F3
> 



Re: Strange sensor names for amdzentemp(4)

2024-03-20 Thread J. Hannken-Illjes



> On Mar 20, 2024, at 4:27 PM, Paul Goyette  wrote:
> 
> Oddly, I am seeing the following sensor info.  Note that the config
> doesn't even contain ``ccd''.  (Previous incarnations of this config
> _did_ have ccd, but it's been completely removed when I changed to
> use raidframe...)
> 
> # envstat -d amdzentemp0
> Current  CritMax  WarnMax  WarnMin  CritMin  Unit
> cpu0 temperature:55.125  degC
> cpu0 ccd0 temperature:36.375  degC
> cpu0 ccd1 temperature:37.500  degc
> #

The string originates from sys/arch/x86/pci/amdzentemp.c line 471.

In this context CCD is a synonym for Core Complex Die.

--
J. Hannken-Illjes - hann...@mailbox.org



Re: ethernet

2023-12-23 Thread J. Hannken-Illjes
> On 23. Dec 2023, at 16:17, xuser  wrote:
> 
> Does any one know how to have two ip addresses on one interface?

ifconfig IF inet ADDR alias

--
J. Hannken-Illjes - hann...@mailbox.org


Re: Panic on dump over fss

2023-03-22 Thread J. Hannken-Illjes
> On 22. Mar 2023, at 15:03, César Catrián C.  wrote:
> 
>> /sbin/dump -0af - -x /var/tmp /home | ...
>> 
>> Dump will take and release the snapshot for you, see "man dump".
>> 
> 
> Got the same panic with the command:
> 
> # /sbin/dump -0af - -x /var/tmp / | /usr/bin/bzip2 -1 > 
> /mnt/fs4/backups/current/root.dum
> p.bz2
>  DUMP: Found /dev/rdk0 on / in /etc/fstab
> 
> [ 6216.0201020] panic: kernel diagnostic assertion "(req->req_bp->b_flags & 
> B_PHYS) != 0" failed: file 
> "/home/src/netbsd-current/src/sys/arch/xen/xen/xbd_xenbus.c", line 1374
> [ 6216.0201020] cpu0: Begin traceback...
> [ 6216.0201020] vpanic() at netbsd:vpanic+0x183
> [ 6216.0201020] kern_assert() at netbsd:kern_assert+0x4b
> [ 6216.0201020] xbd_diskstart() at netbsd:xbd_diskstart+0x7d4
> [ 6216.0201020] dk_start() at netbsd:dk_start+0xe0
> [ 6216.0201020] bdev_strategy() at netbsd:bdev_strategy+0x81
> [ 6216.0201020] spec_strategy() at netbsd:spec_strategy+0x6e
> [ 6216.0201020] VOP_STRATEGY() at netbsd:VOP_STRATEGY+0x3c
> [ 6216.0201020] dkstart() at netbsd:dkstart+0x184
> [ 6216.0201020] bdev_strategy() at netbsd:bdev_strategy+0x81
> [ 6216.0201020] fss_bs_thread() at netbsd:fss_bs_thread+0x32c
> [ 6216.0201020] cpu0: End traceback...
> 
> [ 6216.0201020] dumping to dev 168,1 (offset=8, size=1048576):
> [ 6216.0201020] dump device bad

Could you try the attached patch?

--
J. Hannken-Illjes - hann...@mailbox.org


fss.c.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: Panic on dump over fss

2023-03-22 Thread J. Hannken-Illjes
> On 22. Mar 2023, at 13:43, César Catrián C.  wrote:
> 
> Hi, please help with an issue while running dump(8) over a snapshotted 
> filesystem using fss.
> 
> The system is current 10.99.2 pvh Xen VM, compiled with mar 18 2023 sources, 
> running under NetBSD Xen, current 10.99.2 also, with feb 23 2023 sources.
> 
> Did other successful backups on two stable NetBSD 9.3 VMs.
> 
> The dump file is being put over a nsf share in another NetBSD machine.
> 
> These are the commands used:
> /usr/sbin/fssconfig fss0 /home /var/tmp
> /sbin/mount /dev/fss0 /mnt/fs3
> /sbin/dump -0af - /mnt/fs3 | /usr/bin/bzip2 > 
> /mnt/fs4/backups/current/home.dump.bz2

There is no need to mount the snapshot, the simplest way to dump is:

/sbin/dump -0af - -x /var/tmp /home | ...

Dump will take and release the snapshot for you, see "man dump".

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: blocklist puzzle

2023-02-19 Thread J. Hannken-Illjes
> On 18. Feb 2023, at 23:34, Patrick Welche  wrote:
> 
> 12 hours after rebooting
> 
> # npfctl rule blocklistd list
> block in final family inet4 proto tcp from 61.177.173.35/32 to any port 22 # 
> id="1"
> #
> 
> contains a single block, yet /var/log/messages is full:
> 
> Feb 18 17:47:44 mail blocklistd[596]: blocked 195.226.194.142/32:22 for 
> 172800 seconds
> Feb 18 18:18:00 mail blocklistd[596]: released 171.225.184.179/32:22 after 
> 172800 seconds
> Feb 18 18:18:07 mail blocklistd[596]: blocked 195.226.194.142/32:22 for 
> 172800 seconds
> Feb 18 18:35:18 mail blocklistd[596]: blocked 31.41.244.124/32:22 for 172800 
> seconds
> Feb 18 18:48:10 mail blocklistd[596]: blocked 195.226.194.242/32:22 for 
> 172800 seconds
> Feb 18 19:18:02 mail blocklistd[596]: blocked 195.226.194.142/32:22 for 
> 172800 seconds
> Feb 18 20:18:13 mail blocklistd[596]: blocked 195.226.194.142/32:22 for 
> 172800 seconds
> Feb 18 20:47:46 mail blocklistd[596]: blocked 195.226.194.242/32:22 for 
> 172800 seconds
> Feb 18 21:17:48 mail blocklistd[596]: blocked 195.226.194.242/32:22 for 
> 172800 seconds
> Feb 18 21:47:55 mail blocklistd[596]: blocked 195.226.194.242/32:22 for 
> 172800 seconds
> 
> 
> 
> If something were misconfigured, I would expect no hosts in the ruleset,
> rather than some (or one). How can this work partially?
> 
> extract of npf.conf:
> 
> group "external" on $ext_if {
>pass stateful out final all
> 
>ruleset "blocklistd"
> 
> ...

Looks like your ruleset "blocklistd" never fires as the rule above is "final 
all".

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: 9.99.104: panic in tcp_shutdown_wrapper

2022-10-30 Thread J. Hannken-Illjes
> On 30. Oct 2022, at 06:52, Michael van Elst  wrote:
> 
> ozak...@netbsd.org (Ryota Ozaki) writes:
> 
>> I've committed a possible fix.  Could you try it?
> 
>> Thanks,
>> ozaki-r
> 
> 
> I just got a NULL pointer dereference in tcp_ctloutput where
> the previous check for inp == NULL is also missing.
> 
> [ 24837.756043] fp c0016794db70 tcp_ctloutput() at c02ec4b4 
> netbsd:tcp_ctloutput+0x94
> [ 24837.756043] fp c0016794dcc0 tcp_ctloutput_wrapper() at 
> c02d2680 netbsd:tcp_ctloutput_wrapper+-0x31150
> [ 24837.756043] fp c0016794dcf0 sosetopt() at c0603cbc 
> netbsd:sosetopt+0x78
> [ 24837.756043] fp c0016794ddb0 sys_setsockopt() at c060b0fc 
> netbsd:sys_setsockopt+0x7c
> [ 24837.766041] fp c0016794de20 syscall() at c00b30fc 
> netbsd:syscall+0x19c
> 
> That's:
> 
> int
> tcp_ctloutput(int op, struct socket *so, struct sockopt *sopt)
> {
> ...
>   s = splsoftnet();
>inp = sotoinpcb(so);
> ...
>}
>tp = intotcpcb(inp); <-
> 
>switch (op) {

... and Syzcaller (https://syzkaller.appspot.com/netbsd) has a
bunch of new tcp related crashes starting ~2 days before ...

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: Doc error - sysctl

2022-07-25 Thread J. Hannken-Illjes
> On 25. Jul 2022, at 16:30, Paul Goyette  wrote:
> 
> It seems that kern.maxvnodes is dodcumented as "cannot be lowered"
> 
>   kern.maxvnodes (KERN_MAXVNODES)
>   The maximum number of vnodes available on the system.  This can
>   only be raised.
> 
> However, the kernel allows you to lower the value, and it helps if
> you want to flush file cache (free up active memory).

Yes, it can be lowered and will fail if you try to go below the number of
active vnodes. Please go ahead and fix the documentation.

> 
> 
> ++--+--+
> | Paul Goyette   | PGP Key fingerprint: | E-mail addresses:|
> | (Retired)  | FA29 0E3B 35AF E8AE 6651 | p...@whooppee.com|
> | Software Developer | 0786 F758 55DE 53BA 7731 | pgoye...@netbsd.org  |
> | & Network Engineer |  | pgoyett...@gmail.com |
> +--------+--+--+

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: pgdaemon high CPU consumption

2022-07-01 Thread J. Hannken-Illjes
> On 1. Jul 2022, at 07:55, Matthias Petermann  wrote:
> 
> Good day,
> 
> since some time I noticed that on several of my systems with NetBSD/amd64 
> 9.99.97/98 after longer usage the kernel process pgdaemon completely claims a 
> CPU core for itself, i.e. constantly consumes 100%.
> The affected systems do not have a shortage of RAM and the problem does not 
> disappear even if all workloads are stopped, and thus no RAM is actually used 
> by application processes.
> 
> I noticed this especially in connection with accesses to the ZFS set up on 
> the respective machines - for example after checkout from the local CVS relic 
> hosted on ZFS.
> 
> Is there already a known problem or what information would have to be 
> collected to get to the bottom of this?
> 
> I currently have such a case online, so I would be happy to pull diagnostic 
> information this evening/afternoon. At the moment all info I have is from top.
> 
> Normal view:
> 
> ```
>  PID USERNAME PRI NICE   SIZE   RES STATE   TIME   WCPUCPU COMMAND
>0 root 1260 0K   34M CPU/0 102:45   100%   100% [system]
> ```
> 
> Thread view:
> 
> 
> ```
>  PID   LID USERNAME PRI STATE   TIME   WCPUCPU NAME  COMMAND
>0   173 root 126 CPU/1  96:57 98.93% 98.93% pgdaemon  [system]
> ```

Looks a lot like kern/55707: ZFS seems to trigger a lot of xcalls

Last action proposed was to back out the patch ...

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: kernel deadlock on fstchg with vnd

2022-05-30 Thread J. Hannken-Illjes
> On 29. May 2022, at 23:57, Manuel Bouyer  wrote:
> 
> On Sun, May 29, 2022 at 01:18:16PM +0200, J. Hannken-Illjes wrote:
>>> On 29. May 2022, at 08:30, Michael van Elst  wrote:
>>> 
>>> bou...@antioche.eu.org (Manuel Bouyer) writes:
>>> 
>>>> Hello,
>>>> do you have an idea on the problem in this thread:
>>>> http://mail-index.netbsd.org/port-xen/2022/05/27/msg010213.html
>>> [...]
>>>> I can't reproduce this when using vnd from userland.
>>> 
>>> You can replicate it by addressing the block device with vnconfig.
>>> 
>>> A workaround would be to modify the Xen block script to select the
>>> raw device:
>>> 
>>> vnconfig /dev/r${disk}d $xparams >/dev/null; then
>>> 
>>> or just the disk name:
>>> 
>>> vnconfig ${disk} $xparams >/dev/null; then
>> 
>> Good catch, sys/dev/vnd.c has this:
>> 
>>  1751  static void
>>  1752  vndclear(struct vnd_softc *vnd, int myminor)
>>  1753  {
>>  1754  struct vnode *vp = vnd->sc_vp;
>>  1755  int fflags = FREAD;
>>  1756  int bmaj, cmaj, i, mn;
>>  1757  int s;
>>  1758
>>  1759  #ifdef DEBUG
>>  1760  if (vnddebug & VDB_FOLLOW)
>>  1761  printf("vndclear(%p): vp %p\n", vnd, vp);
>>  1762  #endif
>>  1763  /* locate the major number */
>>  1764  bmaj = bdevsw_lookup_major(_bdevsw);
>>  1765  cmaj = cdevsw_lookup_major(_cdevsw);
>>  1766
>>  1767  /* Nuke the vnodes for any open instances */
>>  1768  for (i = 0; i < MAXPARTITIONS; i++) {
>>  1769  mn = DISKMINOR(device_unit(vnd->sc_dev), i);
>>  1770  vdevgone(bmaj, mn, mn, VBLK);
>>  1771  if (mn != myminor) /* XXX avoid to kill own vnode */
>>  1772  vdevgone(cmaj, mn, mn, VCHR);
>>  1773  }
>> 
>> The "skip myself" on lines 1771/1772 is responsible for this behaviour.
> 
> Yes and doing the same for block devices avoids the issue.
> But Taylor is reluctant to commit this hack.

And he is right.  It smells fishy to detach a (pseudo) device from
an open instance of itself, either with ioctl or close.

Why do we detach on last close -- isn't it sufficient to detach
either explicit with drvctl(8) or on module unload?

The attached diff moves vdevgone() to vnd_detach() and no longer
detaches on last close -- comments?

--
J. Hannken-Illjes - hann...@mailbox.org



vnd.c.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: kernel deadlock on fstchg with vnd

2022-05-29 Thread J. Hannken-Illjes
> On 29. May 2022, at 08:30, Michael van Elst  wrote:
> 
> bou...@antioche.eu.org (Manuel Bouyer) writes:
> 
>> Hello,
>> do you have an idea on the problem in this thread:
>> http://mail-index.netbsd.org/port-xen/2022/05/27/msg010213.html
> [...]
>> I can't reproduce this when using vnd from userland.
> 
> You can replicate it by addressing the block device with vnconfig.
> 
> A workaround would be to modify the Xen block script to select the
> raw device:
> 
> vnconfig /dev/r${disk}d $xparams >/dev/null; then
> 
> or just the disk name:
> 
> vnconfig ${disk} $xparams >/dev/null; then

Good catch, sys/dev/vnd.c has this:

  1751  static void
  1752  vndclear(struct vnd_softc *vnd, int myminor)
  1753  {
  1754  struct vnode *vp = vnd->sc_vp;
  1755  int fflags = FREAD;
  1756  int bmaj, cmaj, i, mn;
  1757  int s;
  1758
  1759  #ifdef DEBUG
  1760  if (vnddebug & VDB_FOLLOW)
  1761  printf("vndclear(%p): vp %p\n", vnd, vp);
  1762  #endif
  1763  /* locate the major number */
  1764  bmaj = bdevsw_lookup_major(_bdevsw);
  1765  cmaj = cdevsw_lookup_major(_cdevsw);
  1766
  1767  /* Nuke the vnodes for any open instances */
  1768  for (i = 0; i < MAXPARTITIONS; i++) {
  1769  mn = DISKMINOR(device_unit(vnd->sc_dev), i);
  1770  vdevgone(bmaj, mn, mn, VBLK);
  1771  if (mn != myminor) /* XXX avoid to kill own vnode */
  1772  vdevgone(cmaj, mn, mn, VCHR);
  1773  }

The "skip myself" on lines 1771/1772 is responsible for this behaviour.

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)

2022-05-27 Thread J. Hannken-Illjes
> On 27. May 2022, at 16:24, Manuel Bouyer  wrote:
> 
> On Fri, May 27, 2022 at 02:52:55PM +0200, J. Hannken-Illjes wrote:
>>> On 27. May 2022, at 14:41, Matthias Petermann  wrote:
>>> 
>>> Hello Jürgen,
>>> 
>>> Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes:
>>>> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump"
>>>> should give even more details.
>>> 
>>> here is the stacktrace from the vnconfig process (the PID has changed since 
>>> I restarted):
>>> 
>>> https://www.petermann-it.de/tmp/p7.jpg
>> 
>> This is the thread currently suspending the root fs (vrevoke suspends it).
>> 
>> Looks like it is waiting for I/O to drain on the vnd device ...
>> 
>>> You can find the output of fstrans_dump here:
>>> 
>>> https://www.petermann-it.de/tmp/p8.jpg
>> 
>> The owner is irritating, it should be vnconfig from above.
> 
> I can reproduce it:

What is the recipe?

> db{0}> ps
> PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
> 2419  2419 3   8 0   9000210b9280   tcsh fstchg
> 2415  2415 3  11 0   90001f66f540   vnconfig fstchg
> 2416  2416 3  18 0   900020ea3200dirname fstchg
> 2417  2417 3  24 0   900020e6c700 sh fstchg
> 2414  2414 3  12 0   90001f6d7a00   vnconfig specio
> [...]
> db{0}> tr/t 0t2415
> trace: pid 2415 lid 2415 at 0x90008ed3e980
> sleepq_block() at netbsd:sleepq_block+0x12c
> cv_wait() at netbsd:cv_wait+0x42
> fstrans_start() at netbsd:fstrans_start+0x193
> VOP_LOCK() at netbsd:VOP_LOCK+0x79
> vn_lock() at netbsd:vn_lock+0xae
> namei_tryemulroot() at netbsd:namei_tryemulroot+0x1024
> namei() at netbsd:namei+0x29
> vn_open() at netbsd:vn_open+0x133
> do_open() at netbsd:do_open+0xc3
> do_sys_openat() at netbsd:do_sys_openat+0x74
> sys_open() at netbsd:sys_open+0x24
> syscall() at netbsd:syscall+0x18c
> --- syscall (number 5) ---
> netbsd:syscall+0x18c:
> db{0}> tr/t 0t2414
> trace: pid 2414 lid 2414 at 0x90008c57e6c0
> sleepq_block() at netbsd:sleepq_block+0x12c
> cv_wait() at netbsd:cv_wait+0x42
> spec_io_drain() at netbsd:spec_io_drain+0x84
> spec_close() at netbsd:spec_close+0x1c6
> VOP_CLOSE() at netbsd:VOP_CLOSE+0x38
> spec_node_revoke() at netbsd:spec_node_revoke+0x14d
> vcache_reclaim() at netbsd:vcache_reclaim+0x4e7
> vgone() at netbsd:vgone+0xcd
> vrevoke() at netbsd:vrevoke+0xfa
> genfs_revoke() at netbsd:genfs_revoke+0x13
> VOP_REVOKE() at netbsd:VOP_REVOKE+0x35
> vdevgone() at netbsd:vdevgone+0x64
> vnddoclear.part.0() at netbsd:vnddoclear.part.0+0xaa
> vndioctl() at netbsd:vndioctl+0x78c
> bdev_ioctl() at netbsd:bdev_ioctl+0x91
> spec_ioctl() at netbsd:spec_ioctl+0xa5
> VOP_IOCTL() at netbsd:VOP_IOCTL+0x41
> vn_ioctl() at netbsd:vn_ioctl+0xb3
> sys_ioctl() at netbsd:sys_ioctl+0x555
> syscall() at netbsd:syscall+0x18c
> --- syscall (number 54) ---
> netbsd:syscall+0x18c:
> db{0}> call fstrans_dump
> Fstrans locks by lwp:
> [ 5691.3454404] 2414.241 (/) shared 2 cow 0 alias 0
> [ 5691.3454404] Fstrans state by mount:
> [ 5691.3454404] /owner 0x90001f6d7a00 state suspended
> 
> In the ps output there is also:
> 0 2324 3   3   200   90001fe43340       vnd0 vndbp
> db{0}> tr/a 90001fe43340
> trace: pid 0 lid 2324 at 0x90008c806df0
> sleepq_block() at netbsd:sleepq_block+0x12c
> vndthread() at netbsd:vndthread+0x78c
> 
> So it looks like vnconfig waits for the vnd I/O to drain, but the vnd thread
> is idle.

No -- the name is confusing, it waits for spec_io_enter/exit to drain.

Better ask Taylor ...

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)

2022-05-27 Thread J. Hannken-Illjes
> On 27. May 2022, at 14:41, Matthias Petermann  wrote:
> 
> Hello Jürgen,
> 
> Am 27.05.2022 um 14:14 schrieb J. Hannken-Illjes:
>> Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump"
>> should give even more details.
> 
> here is the stacktrace from the vnconfig process (the PID has changed since I 
> restarted):
> 
> https://www.petermann-it.de/tmp/p7.jpg

This is the thread currently suspending the root fs (vrevoke suspends it).

Looks like it is waiting for I/O to drain on the vnd device ...

> You can find the output of fstrans_dump here:
> 
> https://www.petermann-it.de/tmp/p8.jpg

The owner is irritating, it should be vnconfig from above.

> I hope this helps a bit in troubleshooting.
> 
> Kind regards
> Matthias

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: NetBSD Xen guest freezes system + vif MAC address confusion (NetBSD 9.99.97 / Xen 4.15.2)

2022-05-27 Thread J. Hannken-Illjes
> On 27. May 2022, at 14:06, Matthias Petermann  wrote:
> 
> Hi Manuel,
> 
> Am 27.05.2022 um 12:14 schrieb Manuel Bouyer:
>>> Paginated processes list:
>>> 
>>> https://www.petermann-it.de/tmp/p1.jpg
>>> https://www.petermann-it.de/tmp/p2.jpg
>>> https://www.petermann-it.de/tmp/p3.jpg
>> several processes in fstchg wait, a stack trace of these processes
>> (tr/t 0t or tr/a 0x would show theses) would help.
>> 
>> So it looks like a deadlock in the filesystem. What is your storage
>> configuration ?
>> 
> 
> Thanks for your advice - I did another series of screenshots and prepared the 
> relevant information here:
> 
>https://www.petermann-it.de/tmp/p6.png
> 
> My storage configuration this time is nothing out of the ordinary:
> 
> ```
> wd0 (GPT)
> |
> '-- dk0 (NAME:root, FFSv2 with log, contains the root filesystem)
> '---dk1 (NAME:swap)
> '---dk2 (NAME:data, FFSv2 with log, contains VND-Images)
>  |
>  '-- net.img (16 GB   sparse file image)
>  '-- net-export.img  (500 GB  sparse file image)
> ```
> 
> Since you bring up the deadlock / filesystem assumption - I did an additional 
> test right away. My original test case uses both CPU cores in Dom0. The 
> modified test boots Dom0 with "dom0_max_vcpus=1 dom0_vcpus_pin" so that only 
> one core is available. With only one core in the Dom0 at least the VM is 
> instantiated (meaning the "xl create" command comes back as expected, and the 
> Dom0 stays responsive for a little while (in contrast to the original test - 
> I was now able to perform "xl list" and did see the VM. Anyway, Once I try to 
> "xl console" I did only get a fragment:
> 
> ```
> ganymed$ doas xl console net
> [   1.000] cpu_rng: rdrand
> [   1.000] entropy: ready
> [   1.000] Copyright (c) 1996, 1997, 1998, 1999,
> ```
> 
> At the "1999," the Dom0 became frozen, again.
> 
> Kind regards
> Matthias
> 

Stack trace of thread vnconfig (1239) and from ddb "call fstrans_dump"
should give even more details.

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE

2022-04-23 Thread J. Hannken-Illjes
> On 23. Apr 2022, at 14:45, Takahiro Kambe  wrote:
> 
> Hi,
> 
> In message <029c86d6-e0d2-4c98-8798-4bdc39ba0...@mailbox.org>
>   on Sat, 23 Apr 2022 10:22:13 +0200,
>   "J. Hannken-Illjes"  wrote:
>>> On 23. Apr 2022, at 10:15, J. Hannken-Illjes  wrote:
>> 
>>> Please try the attached diff (with mount option "discard").
>> 
>> ... and remove the "#define TRIMDEBUG" from the top of ffs_alloc.c first ...
> Thanks!!
> 
> Now, it works fine with "discard" option.

Committed.

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE

2022-04-23 Thread J. Hannken-Illjes
> On 23. Apr 2022, at 10:15, J. Hannken-Illjes  wrote:

> Please try the attached diff (with mount option "discard").

... and remove the "#define TRIMDEBUG" from the top of ffs_alloc.c first ...

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE

2022-04-23 Thread J. Hannken-Illjes
> On 23. Apr 2022, at 05:17, Takahiro Kambe  wrote:
> 
> Hi,
> 
> In message <15bebcc1-4756-46ad-a424-e5232065b...@mailbox.org>
>   on Fri, 22 Apr 2022 19:39:35 +0200,
>   "J. Hannken-Illjes"  wrote:
>>> #5  0x80e69314 in VOP_FDISCARD (vp=0x879fb253da40,
>>>   pos=, len=)
>>>   at /usr/src/sys/kern/vnode_if.c:845
>>> #6  0x80e69314 in VOP_FDISCARD (vp=0x879fb2e99cc0,
>>>   pos=pos@entry=5843857408, len=len@entry=2048)
>>>   at /usr/src/sys/kern/vnode_if.c:845
>> 
>> This one is different from the previous stack trace, two VOP_FDISCARD().

> # cat /etc/fstab
> /dev/dk0/efimsdos   rw,noauto   0 0
> /dev/dk4/   ffs rw,discard  1 1

Ok, you have wedges that introduce another indirection.

Please try the attached diff (with mount option "discard").

--
J. Hannken-Illjes - hann...@mailbox.org



fdiscard.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE

2022-04-22 Thread J. Hannken-Illjes
> On 22. Apr 2022, at 16:25, Takahiro Kambe  wrote:
> 
> Hi,
> 
> In message 
>   on Fri, 22 Apr 2022 09:44:48 +0200,
>   "J. Hannken-Illjes"  wrote:
>>>> Thanks - I can confirm that a kernel from yesterday doesn't have the
>>>> issue any longer.
>>> 
>>> I still have panic() on ThinkPad E495.
>>> 
>>> panic: kernel diagnostic assertion "VOP_ISLOCKED(vp) == LK_EXCLUSIVE" 
>>> failed: file "/usr/src/sys/miscfs/specfs/spec_vnops.c", line 1252
>>> cpu3: Begin traceback...
>>> vpanic() at netbsd:vpanic+0x183
>>> kern_assert() at netbsd:kern_assert+0x4b
>>> spec_fdiscard() at netbsd:spec_fdiscard+0xaa
>>> VOP_FDISCARD() at netbsd:VOP_FDISCARD+0x3d
>>> ffs_discardcb() at netbsd:ffs_discardcb+0x2e
>>> workqueue_worker() at netbsd:workqueue_worker+0xd7
>> 
>> 
>> Is the attached diff sufficient to fix your problem?
> Sadly, no luck.
> 
> (gdb) where
> #0  0x802261f5 in cpu_reboot (howto=howto@entry=260,
>bootstr=bootstr@entry=0x0) at /usr/src/sys/arch/amd64/amd64/machdep.c:720
> #1  0x80da5414 in kern_reboot (howto=howto@entry=260,
>bootstr=bootstr@entry=0x0) at /usr/src/sys/kern/kern_reboot.c:73
> #2  0x80deb32d in vpanic (
>fmt=0x813938f8 "kernel %sassertion \"%s\" failed: file \"%s\", 
> line %d ", ap=ap@entry=0xb80249389e48) at /usr/src/sys/kern/subr_prf.c:293
> #3  0x80fad18f in kern_assert (
>fmt=fmt@entry=0x813938f8 "kernel %sassertion \"%s\" failed: file 
> \"%s\", line %d ") at /usr/src/sys/lib/libkern/kern_assert.c:51
> #4  0x80e76595 in spec_fdiscard (v=0xb80249389ee0)
>at /usr/src/sys/miscfs/specfs/spec_vnops.c:1252
> #5  0x80e69314 in VOP_FDISCARD (vp=0x879fb253da40,
>pos=, len=)
>at /usr/src/sys/kern/vnode_if.c:845
> #6  0x80e69314 in VOP_FDISCARD (vp=0x879fb2e99cc0,
>pos=pos@entry=5843857408, len=len@entry=2048)
>at /usr/src/sys/kern/vnode_if.c:845

This one is different from the previous stack trace, two VOP_FDISCARD().

Could you print the vnodes and mounts from frame #6 and #5,
( print *vp and print *vp->v_mount ) please?

What is mounted ( /etc/fstab and mount )?

> #7  0x80cdf921 in ffs_discardcb (wk=0x879fb4199840,
>arg=0x879fb3053f40) at /usr/src/sys/ufs/ffs/ffs_alloc.c:1656
> #8  0x80df4fd6 in workqueue_runlist (list=0x879fb2c34ee8,
>list=0x879fb2c34ee8, wq=0x879fb2c34e80)
>at /usr/src/sys/kern/subr_workqueue.c:105
> #9  workqueue_worker (cookie=0x879fb2c34e80)
>at /usr/src/sys/kern/subr_workqueue.c:135
> #10 0x8020b327 in lwp_trampoline ()
> #11 0x in ?? ()

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE

2022-04-22 Thread J. Hannken-Illjes
> On 21. Apr 2022, at 16:57, Takahiro Kambe  wrote:
> 
> In message 
>   on Mon, 18 Apr 2022 09:56:55 +0200,
>   Thomas Klausner  wrote:
>>> Already committed by Taylor R Campbell as sequencer.c Rev. 1.79
>>> on 2022/04/16 11:13:10.
>> 
>> Thanks - I can confirm that a kernel from yesterday doesn't have the
>> issue any longer.
> 
> I still have panic() on ThinkPad E495.
> 
> panic: kernel diagnostic assertion "VOP_ISLOCKED(vp) == LK_EXCLUSIVE" failed: 
> file "/usr/src/sys/miscfs/specfs/spec_vnops.c", line 1252
> cpu3: Begin traceback...
> vpanic() at netbsd:vpanic+0x183
> kern_assert() at netbsd:kern_assert+0x4b
> spec_fdiscard() at netbsd:spec_fdiscard+0xaa
> VOP_FDISCARD() at netbsd:VOP_FDISCARD+0x3d
> ffs_discardcb() at netbsd:ffs_discardcb+0x2e
> workqueue_worker() at netbsd:workqueue_worker+0xd7


Is the attached diff sufficient to fix your problem?

--
J. Hannken-Illjes - hann...@mailbox.org




ffs_alloc.c.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: reproducible kernel crash with quota

2022-04-21 Thread J. Hannken-Illjes
> On 21. Apr 2022, at 00:36, 6b...@6bone.informatik.uni-leipzig.de wrote:
> 
> On Wed, 20 Apr 2022, J. Hannken-Illjes wrote:
> 
>> Date: Wed, 20 Apr 2022 22:19:30 +0200
>> From: J. Hannken-Illjes 
>> To: 6b...@6bone.informatik.uni-leipzig.de
>> Cc: current-users@netbsd.org, Manuel Bouyer 
>> Subject: [Extern] Re: reproducible kernel crash with quota
>>> On 20. Apr 2022, at 22:10, 6b...@6bone.informatik.uni-leipzig.de wrote:
>>> 
>>> On Tue, 19 Apr 2022, J. Hannken-Illjes wrote:
>>> 
>>>> Date: Tue, 19 Apr 2022 11:07:48 +0200
>>>> From: J. Hannken-Illjes 
>>>> To: 6b...@6bone.informatik.uni-leipzig.de
>>>> Cc: current-users@netbsd.org, Manuel Bouyer 
>>>> Subject: [Extern] Re: reproducible kernel crash with quota
>>>>> On 19. Apr 2022, at 08:38, 6b...@6bone.informatik.uni-leipzig.de wrote:

>>>> Please try again with both diffs applied.
>>> 
>>> I tested with both patches. If I just enable querquota it seems to work. If 
>>> you also activate groupquota, the kernel crashes:
>>> 
>>> output:
>>> 
>>> /etc/rc.d/quota restart
>>> Checking quotas:quotacheck: creating quota file //quota.group
>> 
>> You have root (/) with quota?  What exactly do you have in /etc/fstab?
> 
> cat /etc/fstab
> # NetBSD /etc/fstab
> # See /usr/share/examples/fstab/ for more examples.
> NAME=179d5ca2-7f26-476b-b544-823bd1849816   /   ffs 
> rw,userquota,groupquota  1 1

I'm confused.  With "/dev/ld0a / ffs rw,userquota,groupquota 1 1"
in /etc/fstab and both patches applied I get:

$ /etc/rc.d/quota restart
Checking quotas: done.

No line "creating quota file ..."

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: reproducible kernel crash with quota

2022-04-20 Thread J. Hannken-Illjes
> On 20. Apr 2022, at 22:10, 6b...@6bone.informatik.uni-leipzig.de wrote:
> 
> On Tue, 19 Apr 2022, J. Hannken-Illjes wrote:
> 
>> Date: Tue, 19 Apr 2022 11:07:48 +0200
>> From: J. Hannken-Illjes 
>> To: 6b...@6bone.informatik.uni-leipzig.de
>> Cc: current-users@netbsd.org, Manuel Bouyer 
>> Subject: [Extern] Re: reproducible kernel crash with quota
>>> On 19. Apr 2022, at 08:38, 6b...@6bone.informatik.uni-leipzig.de wrote:
>>> 
>>> On Thu, 14 Apr 2022, J. Hannken-Illjes wrote:
>>> 
>>>> Date: Thu, 14 Apr 2022 13:09:02 +0200
>>>> From: J. Hannken-Illjes 
>>>> To: 6b...@6bone.informatik.uni-leipzig.de
>>>> Cc: current-users@netbsd.org, Manuel Bouyer 
>>>> Subject: [Extern] Re: reproducible kernel crash with quota
>>>>> On 12. Apr 2022, at 08:52, 6b...@6bone.informatik.uni-leipzig.de wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> since I already have some open bugs with reproducible kernel crashes, I'm 
>>>>> only writing this to the mailing list.
>>>>> 
>>>>> how to reproduce the crash: /etc/rc.d/quota restart
>>>>> 
>>>>> dmesg:
>>>>> 
>>>>> [   412.047595] panic: kernel diagnostic assertion 
>>>>> "dq->dq_ump->um_quotas[dq->dq _type] != vp" failed: file 
>>>>> "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978
>>>>> [   412.047595] cpu8: Begin traceback...
>>>>> [   412.047595] vpanic() at netbsd:vpanic+0x156
>>>>> [   412.057595] kern_assert() at netbsd:kern_assert+0x4b
>>>>> [   412.057595] dqflush() at netbsd:dqflush+0x92
>>>>> [   412.057595] quota1_handle_cmd_quotaoff() at 
>>>>> netbsd:quota1_handle_cmd_quotaof f+0x120
>>>>> [   412.057595] ufs_quotactl() at netbsd:ufs_quotactl+0x3d
>>>>> [   412.057595] VFS_QUOTACTL() at netbsd:VFS_QUOTACTL+0x22
>>>>> [   412.057595] vfs_quotactl_quotaoff() at 
>>>>> netbsd:vfs_quotactl_quotaoff+0x1b
>>>>> [   412.057595] do_sys_quotactl() at netbsd:do_sys_quotactl+0xf1
>>>>> [   412.067595] sys___quotactl() at netbsd:sys___quotactl+0x2e
>>>>> [   412.067595] syscall() at netbsd:syscall+0x196
>>>>> [   412.067595] --- syscall (number 473) ---
>>>>> [   412.067595] netbsd:syscall+0x196:
>>>>> [   412.067595] cpu8: End traceback...
>>>>> 
>>>>> [   412.067595] dumping to dev 168,1 (offset=8, size=33425953):
>>>>> [   412.067595] dump
>>>>> 
>>>>> 
>>>>> (gdb) target kvm netbsd.1.core
>>>> 
>>>> 
>>>> I'm quite sure you have a /etc/fstab with "userquota,groupquota", yes?
>>>> 
>>>> with gdb:
>>>> 
>>>> frame 4 (dqflush())
>>>> print dq->dq_ump->um_quotas[0]
>>>> print dq->dq_ump->um_quotas[1]
>>>> 
>>>> gives the same vnode address for both fields, yes?
>>>> 
>>>> If this is the case the attached diff should help, since 2012-01-30
>>>> group quota got enabled on the user quota file.
>>>> 
>>>> As a workaround you could try to name the quota files in /etc/fstab
>>>> like "groupquota=XXX/quota.group".
>>> 
>>> You are right. I use groupquota and userquota in fstab. I tested the patch. 
>>> With patch there is no crash. But the /etc/rc.d/quota restart leads to the 
>>> blocking of the file system. You can only turn off the server. This also 
>>> happens when I only use userquota in the fstab.
>> 
>> Sorry, forgot the second diff (now attached) that prevents looping
>> when taking the quota off on a modified file system.
>> 
>> Please try again with both diffs applied.
> 
> I tested with both patches. If I just enable querquota it seems to work. If 
> you also activate groupquota, the kernel crashes:
> 
> output:
> 
> /etc/rc.d/quota restart
> Checking quotas:quotacheck: creating quota file //quota.group

You have root (/) with quota?  What exactly do you have in /etc/fstab?

> done.
> 
> -> crash

Are "dq->dq_ump->um_quotas[0]" and "dq->dq_ump->um_quotas[1]]" now different?

> [   448.325252] panic: kernel diagnostic assertion 
> "dq->dq_ump->um_quotas[dq->dq_type] != vp" failed: file 
> "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978
> [   448.325252] cpu1: Begin traceback...

Re: reproducible kernel crash with quota

2022-04-19 Thread J. Hannken-Illjes
> On 19. Apr 2022, at 08:38, 6b...@6bone.informatik.uni-leipzig.de wrote:
> 
> On Thu, 14 Apr 2022, J. Hannken-Illjes wrote:
> 
>> Date: Thu, 14 Apr 2022 13:09:02 +0200
>> From: J. Hannken-Illjes 
>> To: 6b...@6bone.informatik.uni-leipzig.de
>> Cc: current-users@netbsd.org, Manuel Bouyer 
>> Subject: [Extern] Re: reproducible kernel crash with quota
>>> On 12. Apr 2022, at 08:52, 6b...@6bone.informatik.uni-leipzig.de wrote:
>>> 
>>> Hello,
>>> 
>>> since I already have some open bugs with reproducible kernel crashes, I'm 
>>> only writing this to the mailing list.
>>> 
>>> how to reproduce the crash: /etc/rc.d/quota restart
>>> 
>>> dmesg:
>>> 
>>> [   412.047595] panic: kernel diagnostic assertion 
>>> "dq->dq_ump->um_quotas[dq->dq _type] != vp" failed: file 
>>> "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978
>>> [   412.047595] cpu8: Begin traceback...
>>> [   412.047595] vpanic() at netbsd:vpanic+0x156
>>> [   412.057595] kern_assert() at netbsd:kern_assert+0x4b
>>> [   412.057595] dqflush() at netbsd:dqflush+0x92
>>> [   412.057595] quota1_handle_cmd_quotaoff() at 
>>> netbsd:quota1_handle_cmd_quotaof f+0x120
>>> [   412.057595] ufs_quotactl() at netbsd:ufs_quotactl+0x3d
>>> [   412.057595] VFS_QUOTACTL() at netbsd:VFS_QUOTACTL+0x22
>>> [   412.057595] vfs_quotactl_quotaoff() at netbsd:vfs_quotactl_quotaoff+0x1b
>>> [   412.057595] do_sys_quotactl() at netbsd:do_sys_quotactl+0xf1
>>> [   412.067595] sys___quotactl() at netbsd:sys___quotactl+0x2e
>>> [   412.067595] syscall() at netbsd:syscall+0x196
>>> [   412.067595] --- syscall (number 473) ---
>>> [   412.067595] netbsd:syscall+0x196:
>>> [   412.067595] cpu8: End traceback...
>>> 
>>> [   412.067595] dumping to dev 168,1 (offset=8, size=33425953):
>>> [   412.067595] dump
>>> 
>>> 
>>> (gdb) target kvm netbsd.1.core
>> 
>> 
>> I'm quite sure you have a /etc/fstab with "userquota,groupquota", yes?
>> 
>> with gdb:
>> 
>> frame 4 (dqflush())
>> print dq->dq_ump->um_quotas[0]
>> print dq->dq_ump->um_quotas[1]
>> 
>> gives the same vnode address for both fields, yes?
>> 
>> If this is the case the attached diff should help, since 2012-01-30
>> group quota got enabled on the user quota file.
>> 
>> As a workaround you could try to name the quota files in /etc/fstab
>> like "groupquota=XXX/quota.group".
> 
> You are right. I use groupquota and userquota in fstab. I tested the patch. 
> With patch there is no crash. But the /etc/rc.d/quota restart leads to the 
> blocking of the file system. You can only turn off the server. This also 
> happens when I only use userquota in the fstab.

Sorry, forgot the second diff (now attached) that prevents looping
when taking the quota off on a modified file system.

Please try again with both diffs applied.

> Thank you for your efforts
> 
> Regards
> Uwe
> 
> 
>> 
>>> 
>>> Maybe someone can fix the problem.
>>> 
>>> 
>>> Thank you for your efforts
>>> 
>>> 
>>> Regards
>>> Uwe
>> 
>> --
>> J. Hannken-Illjes - hann...@mailbox.org
>> 

--
J. Hannken-Illjes - hann...@mailbox.org



003_quota_flag.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: panic: kernel diagnostic assertion VOP_ISLOCKET(vp) == LK_EXCLUSIVE

2022-04-16 Thread J. Hannken-Illjes
> On 16. Apr 2022, at 17:27, Tobias Nygren  wrote:
> 
> On Sat, 16 Apr 2022 16:51:31 +0200
> Thomas Klausner  wrote:
> 
>> panic: kernel diagnostic assertion "VOP_ISLOCKED(vp) == LK_EXCLUSIVE" 
>> failed: file "/usr/src/sys/miscfs/specfs/spec_vnops.c", line 1555
>> cpu1: Begin traceback...
>> vpanic()
>> kern_assert()
>> spec_close() at netbsd:spec_close+0x2fc
>> VOP_CLOE() at netbsd:vop_close+0x42
>> sequenceropen() at netbsd:sequenceropen+0x359
> 
> "cat /dev/sequencer" as a regular user is enough to trigger this. In
> the midiseq_open() error path it is trying to VOP_CLOSE without the
> vnode lock held. Maybe this patch helps. (Someone with filesystem
> clue please sanity check this.)
> 
> --- sys/dev/sequencer.c   31 Mar 2022 19:30:15 -  1.76
> +++ sys/dev/sequencer.c   16 Apr 2022 15:23:54 -
> @@ -1452,8 +1452,9 @@ midiseq_open(int unit, int flags)
>   if ((mi.props & MIDI_PROP_CAN_INPUT) == 0)
>   flags &= ~FREAD;
>   if ((flags & (FREAD|FWRITE)) == 0) {
> + vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
>   VOP_CLOSE(vp, oflags, kauth_cred_get());
> - vrele(vp);
> +         vput(vp);
>   return NULL;
>   }

Already committed by Taylor R Campbell as sequencer.c Rev. 1.79
on 2022/04/16 11:13:10.

--
J. Hannken-Illjes - hann...@mailbox.org


signature.asc
Description: Message signed with OpenPGP


Re: reproducible kernel crash with quota

2022-04-14 Thread J. Hannken-Illjes
> On 12. Apr 2022, at 08:52, 6b...@6bone.informatik.uni-leipzig.de wrote:
> 
> Hello,
> 
> since I already have some open bugs with reproducible kernel crashes, I'm 
> only writing this to the mailing list.
> 
> how to reproduce the crash: /etc/rc.d/quota restart
> 
> dmesg:
> 
> [   412.047595] panic: kernel diagnostic assertion 
> "dq->dq_ump->um_quotas[dq->dq _type] != vp" failed: file 
> "/usr/src/sys/ufs/ufs/ufs_quota.c", line 978
> [   412.047595] cpu8: Begin traceback...
> [   412.047595] vpanic() at netbsd:vpanic+0x156
> [   412.057595] kern_assert() at netbsd:kern_assert+0x4b
> [   412.057595] dqflush() at netbsd:dqflush+0x92
> [   412.057595] quota1_handle_cmd_quotaoff() at 
> netbsd:quota1_handle_cmd_quotaof f+0x120
> [   412.057595] ufs_quotactl() at netbsd:ufs_quotactl+0x3d
> [   412.057595] VFS_QUOTACTL() at netbsd:VFS_QUOTACTL+0x22
> [   412.057595] vfs_quotactl_quotaoff() at netbsd:vfs_quotactl_quotaoff+0x1b
> [   412.057595] do_sys_quotactl() at netbsd:do_sys_quotactl+0xf1
> [   412.067595] sys___quotactl() at netbsd:sys___quotactl+0x2e
> [   412.067595] syscall() at netbsd:syscall+0x196
> [   412.067595] --- syscall (number 473) ---
> [   412.067595] netbsd:syscall+0x196:
> [   412.067595] cpu8: End traceback...
> 
> [   412.067595] dumping to dev 168,1 (offset=8, size=33425953):
> [   412.067595] dump
> 
> 
> (gdb) target kvm netbsd.1.core


I'm quite sure you have a /etc/fstab with "userquota,groupquota", yes?

with gdb:

frame 4 (dqflush())
print dq->dq_ump->um_quotas[0]
print dq->dq_ump->um_quotas[1]

gives the same vnode address for both fields, yes?

If this is the case the attached diff should help, since 2012-01-30
group quota got enabled on the user quota file.

As a workaround you could try to name the quota files in /etc/fstab
like "groupquota=XXX/quota.group".

> 
> Maybe someone can fix the problem.
> 
> 
> Thank you for your efforts
> 
> 
> Regards
> Uwe

--
J. Hannken-Illjes - hann...@mailbox.org


quota_oldfiles.c.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: Unprivileged build can't build custom kernels

2021-12-31 Thread J. Hannken-Illjes
> On 31. Dec 2021, at 12:37, John D. Baker  wrote:
> 
> The recent changes to build "netbsd-${CONF}.debug" seems not to work
> for unprivileged builds when building custom kernels as it wants to
> install the file owned by "root":
> 
> [...]
> #  link  DAVID/netbsd
> /r0/build/current/tools/amd64/bin/sparc--netbsdelf-ld -Map netbsd.map --cref 
> -n -T netbsd.ldscript -Ttext F0004000 -e start -X -X -o netbsd 
> ${SYSTEM_OBJ:[@]:Nswapnetbsd.o} ${EXTRA_OBJ} vers.o swapnetbsd.o
> NetBSD 9.99.93 (DAVID) #376: Fri Dec 31 02:44:31 CST 2021
>   textdata bss dec hex filename
> 4808082  118584  147752 5074418  4d6df2 netbsd
> + mv -f netbsd netbsd.gdb
> + /r0/build/current/tools/amd64/bin/sparc--netbsdelf-objcopy 
> --only-keep-debug netbsd.gdb netbsd-DAVID.debug
> + /r0/build/current/tools/amd64/bin/sparc--netbsdelf-objcopy --strip-debug -p 
> -R .gnu_debuglink --add-gnu-debuglink=netbsd-DAVID.debug netbsd.gdb netbsd
> + chmod 755 netbsd netbsd.gdb netbsd-DAVID.debug
> --- /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug ---
> #   install  /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug
> /r0/build/current/tools/amd64/bin/sparc--netbsdelf-install  -c -p -r -o root 
> -g bin -m 444 netbsd-DAVID.debug 
> /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug
> sparc--netbsdelf-install: 
> /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug.inst.fO9ANt:
>  chown/chgrp: Operation not permitted
> 
> *** Failed target: 
> /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug
> *** Failed commands:
>${_MKTARGET_INSTALL}
>=> @echo '#  ' "install " 
> /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug
>${INSTALL_FILE} -o root -g bin -m 444 ${.ALLSRC} ${.TARGET}
>=> /r0/build/current/tools/amd64/bin/sparc--netbsdelf-install  -c -p 
> -r -o root -g bin -m 444 netbsd-DAVID.debug 
> /r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug
> *** [/r0/build/current/DEST/sparc/usr/libdata/debug/netbsd-DAVID.debug] Error 
> code 1
> 
> nbmake: stopped in /r0/build/current/obj/sparc/sys/arch/sparc/compile/DAVID
> 1 error
> 
> nbmake: stopped in /r0/build/current/obj/sparc/sys/arch/sparc/compile/DAVID
> 
> ERROR: Failed to make debuginstall in 
> "/r0/build/current/obj/sparc/sys/arch/sparc/compile/DAVID"
> *** BUILD ABORTED ***

For me the attached diff works.  It skips the install
outside ${NETBSDSRCDIR}.

--
J. Hannken-Illjes - hann...@mailbox.org



Makefile.kern.inc.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: null mounts seem to lose directories?

2021-08-22 Thread J. Hannken-Illjes
> On 22. Aug 2021, at 09:26, nia  wrote:
> 
> I have various null mounts on top of a tmpfs:
> 
> $ df -h
> ...
> tmpfs 87G   1.0G86G   1% 
> /sandbox/nb9-i386-trunk/chroot/1
> /sandbox/nb9-i386-trunk/data/bulklog 237G79G   158G  33% 
> /sandbox/nb9-i386-trunk/chroot/1/data/bulklog
> ...
> 
> Directories are not being synchronized properly across the null mount:
> 
> procyon$ ls /sandbox/nb9-i386-trunk/chroot/1/data/bulklog/rust-1.52.1nb4/
> build.log checksum.log  configure.log depends.log   pre-clean.log work.log
> procyon$ ls /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/
> ls: /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/: No such file or 
> directory
> 
> The source of the null mount is a ZFS dataset:

> 
> # zfs list
> ...
> tank/sandbox/nb9-i386-trunk/data 124G   158G  79.2G  
> /sandbox/nb9-i386-trunk/data
> ...

Using the attached script I see no problems.  What are you
doing between the "mount -t tmpfs", "mount -t null" and this "ls"?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig

SCRIPT:

mkdir -p /sandbox/nb9-i386-trunk/data
zpool create -m /sandbox/nb9-i386-trunk/data tank /dev/md0

mkdir -p /sandbox/nb9-i386-trunk/chroot/1
mount -t tmpfs tmpfs /sandbox/nb9-i386-trunk/chroot/1

mkdir -p /sandbox/nb9-i386-trunk/data/bulklog
mkdir -p /sandbox/nb9-i386-trunk/chroot/1/data/bulklog
mount -t null /sandbox/nb9-i386-trunk/data/bulklog \
/sandbox/nb9-i386-trunk/chroot/1/data/bulklog

df -h | egrep '(^File|/sand.*chr)'
zfs list

#mkdir -p /sandbox/nb9-i386-trunk/chroot/1/data/bulklog/rust-1.52.1nb4/
mkdir -p /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/
touch /sandbox/nb9-i386-trunk/chroot/1/data/bulklog/rust-1.52.1nb4/build.log.t
touch /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/build.log.z

ls /sandbox/nb9-i386-trunk/chroot/1/data/bulklog/rust-1.52.1nb4/
ls /sandbox/nb9-i386-trunk/data/bulklog/rust-1.52.1nb4/

OUTPUT:

Filesystem Size   Used  Avail %Cap Mounted on
tmpfs   67G   4.0K67G   0% 
/sandbox/nb9-i386-trunk/chroot/1
/sandbox/nb9-i386-trunk/data/bulklog   1.8G23K   1.8G   0% 
/sandbox/nb9-i386-trunk/chroot/1/data/bulklog
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank   304K  1.81G23K  /sandbox/nb9-i386-trunk/data
build.log.t  build.log.z
build.log.t  build.log.z


signature.asc
Description: Message signed with OpenPGP


Re: 9.99.86 HEAD

2021-07-01 Thread J. Hannken-Illjes
> On 1. Jul 2021, at 21:04, David Holland  wrote:
> 
> On Thu, Jul 01, 2021 at 07:54:33PM +0200, J. Hannken-Illjes wrote:
>>  lookup_fastforward -> lookup_parsepath -> VOP_PARSEPATH -> ... -> 
>> fstrans_start
> 
> Bleh. I had a feeling we were going to end up regretting that
> fastforward code. :-|
> 
>> According to vnode_if.src VOP_PARSEPATH(dvp...) should take a locked vnode
>> but here this lock is missing. So either
>> 
>> - make sure the vnode is locked so fstrans_start will no loner block.
>> 
>> or
>> 
>> - add FSTRANS=NO to vop_parsepath, file kern/vnode_if.src and allow unlocked 
>> vnodes:
>> 
>> vop_parsepath {
>> +   FSTRANS=NO
>>IN struct vnode *dvp;
>> 
>> David?
> 
> I thought the vnode was locked readonly in the fastforward path. Did I
> misread? Or is that not good enough?

Nope, the fastforward path takes namecache locks only.

> Anyway, I think it's probably ok to change vop_parsepath to not
> require locked vnodes at all. The only parsepath operation that does
> anything other than string ops is rumpfs's, and it takes etfs_lock to
> look in some tables that etfs_lock covers. Unless that's going to
> interact badly with fstrans without the vnode lock covering it (seems
> unlikely, but IDK) there shouldn't be a problem.

This is ok, the vnode is referenced and comparing it to rootvnode is ok.

> However, except in the fastforward code the vnode will be locked. So I
> think it should be "= = =" in vnode_if.src. If you also need to add
> FSTRANS=NO, that should be fine too.

Setting "= = =" is ok, but it is only a comment.
You also need "FSTRANS=NO" to prevent VOP_PARSEPATH() to take fstrans
and deadlock.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: 9.99.86 HEAD

2021-07-01 Thread J. Hannken-Illjes
> On 1. Jul 2021, at 18:24, Martin Husemann  wrote:
> 
> I did not trust macppc / lockdebug so reproduced it on evbarm.
> 
> Unfortunately nearly identical (not making any sense to me) output again...

I'm quite sure one thread does something like

  lookup_fastforward -> lookup_parsepath -> VOP_PARSEPATH -> ... -> 
fstrans_start

where dvp->v_mount is currently unmounting and therefore suspended.
If lookup_fastforward holds a lock on vi_nc_lock we have a deadlock.

According to vnode_if.src VOP_PARSEPATH(dvp...) should take a locked vnode
but here this lock is missing. So either

- make sure the vnode is locked so fstrans_start will no loner block.

or

- add FSTRANS=NO to vop_parsepath, file kern/vnode_if.src and allow unlocked 
vnodes:

 vop_parsepath {
+   FSTRANS=NO
IN struct vnode *dvp;

David?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: dump/restore out of range inode

2021-06-06 Thread J. Hannken-Illjes
> On 6. Jun 2021, at 10:10, Patrick Welche  wrote:
> 
> On Sat, Jun 05, 2021 at 06:45:24PM +0200, J. Hannken-Illjes wrote:
>> Patrick,
>> 
>> please try the attached diff so the "spcl.c_addr" test
>> no longer runs off the spcl record.
>> 
>> "blks" is used for multi-tape checkpointing and examining
>> TS_INODE/TS_ADDR records should be sufficient as the are
>> the only records that support holes in data.
> 
> Thanks! With your patch, the dump | restore has been happily
> running for about 12 hours now.

Ok, will commit and request pullup next week.

> In your previous email you mention:
> 
>> This trace makes no sense, bitmaps (CLRI and BITS) don't have holes
>> and therefore ignore the "c_addr" array.  I have no idea how dumping
>> a bitmap ends in the hole processing of flushtape().
> 
> Is it worth investigating further while I have the reproducer?

No, this is an error that manifests on file systems with
many inodes and therefore did not raise before.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: dump/restore out of range inode

2021-06-05 Thread J. Hannken-Illjes
Patrick,

please try the attached diff so the "spcl.c_addr" test
no longer runs off the spcl record.

"blks" is used for multi-tape checkpointing and examining
TS_INODE/TS_ADDR records should be sufficient as the are
the only records that support holes in data.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


tape.c.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: dump/restore out of range inode

2021-06-05 Thread J. Hannken-Illjes
> On 5. Jun 2021, at 12:31, Patrick Welche  wrote:
> 
> On Sat, Jun 05, 2021 at 10:03:21AM -, Michael van Elst wrote:
>> pr...@cam.ac.uk (Patrick Welche) writes:
>> 
>>> How can gdb not see a spcl anywhere?
>> 
>> /usr/include/protocols/dumprestore.h:#define spcl u_spcl.s_spcl
>> 
>> spcl is just a define that got resolved by the compiler.
> 
> ach... here it is(gdb) print u_spcl.s_spcl
> 
> $2 = {c_type = 6, c_old_date = 0, c_old_ddate = 0, c_volume = 1,
>  c_old_tapea = 0, c_inumber = 397083647, c_magic = 424935705,
>  c_checksum = 1906085926, __c_ino = {__uc_dinode = {di_mode = 0,
>  di_nlink = 0, di_oldids = {0, 0}, di_size = 0, di_atime = 0,
>  di_atimensec = 0, di_mtime = 0, di_mtimensec = 0, di_ctime = 0,
>  di_ctimensec = 0, di_db = {0 }, di_ib = {0, 0, 0},
>  di_flags = 0, di_blocks = 0, di_gen = 0, di_uid = 0, di_gid = 0,
>  di_modrev = 0}, __uc_ino = {__uc_mode = 0, __uc_spare1 = {0, 0, 0},
>  __uc_size = 0, __uc_old_atime = 0, __uc_atimensec = 0,
>  __uc_old_mtime = 0, __uc_mtimensec = 0, __uc_spare2 = {0, 0},
>  __uc_rdev = 0, __uc_birthtimensec = 0, __uc_birthtime = 0,
>  __uc_atime = 0, __uc_mtime = 0, __uc_spare4 = {0, 0, 0, 0, 0, 0, 0},
>  __uc_file_flags = 0, __uc_spare5 = {0, 0}, __uc_uid = 0, __uc_gid = 0,
>  __uc_spare6 = {0, 0}}}, c_count = 48473,
>  c_addr = '\000' ,
>  c_label = "none", '\000' , c_level = 0,
>  c_filesys = "/store/backup", '\000' ,
>  c_dev = "/dev/rdk18", '\000' ,
>  c_host = "quantz", '\000' , c_flags = 2,
>  c_old_firstrec = 0, c_date = 1622887657, c_ddate = 0, c_tapea = 10,
>  c_firstrec = 0, c_spare = {0 }}
> (gdb) bt
> #0  flushtape () at /usr/src/sbin/dump/tape.c:333
> #1  0x0020763e in writerec (dp=dp@entry=0x7f7ff3a01380 "",
>isspcl=isspcl@entry=0) at /usr/src/sbin/dump/tape.c:168
> #2  0x00208e49 in dumpmap (map=, type=type@entry=6,
>ino=ino@entry=397083647) at /usr/src/sbin/dump/traverse.c:716
> #3  0x0020b355 in main (argc=1, argv=0x7f7fe7e8)
>at /usr/src/sbin/dump/main.c:646
> (gdb) list
> 328 }
> 329
> 330 blks = 0;
> 331 if (iswap32(spcl.c_type) != TS_END) {
> 332 for (i = 0; i < iswap32(spcl.c_count); i++)
> 333 if (spcl.c_addr[i] != 0)
> 334 blks++;
> 335 }
> 336 slp->count = lastspclrec + blks + 1 - iswap64(spcl.c_tapea);
> 337 slp->tapea = iswap64(spcl.c_tapea);
> (gdb) print i
> $6 = 
> (gdb) print u_spcl.s_spcl.c_count
> $7 = 48473
> (gdb) whatis u_spcl.s_spcl.c_addr
> type = char [512]
> 
> so guess optimized_out i >> 512
> 
> c_type==6 = TS_CLRI map of inodes deleted since last dump
> 
> (a bit odd:
> (gdb) print needswap
> $11 = 0
> (gdb) print iswap32(u_spcl.s_spcl.c_count)
> $10 = 1505558528
> )
> 
> Still puzzled...

This trace makes no sense, bitmaps (CLRI and BITS) don't have holes
and therefore ignore the "c_addr" array.  I have no idea how dumping
a bitmap ends in the hole processing of flushtape().

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: zfs howto

2021-02-14 Thread J. Hannken-Illjes
> On 14. Feb 2021, at 02:55, Brad Spencer  wrote:
> 
> Chavdar Ivanov  writes:
> 
> [snip]
> 
>>> I am not sure of the complete context of the statement, but I do this
>>> all of the time with normal NetBSD NFS against a ZFS fileset.
>>> 
>>> build% cat /etc/exports
>>> /usr/installed_src/PKGSRC_2018Q4 -alldirs -maproot=root 
>>> anotherbuild.system.eldar.org
>>> 
>>> build% zfs list /usr/installed_src/PKGSRC_2018Q4
>>> NAME   USED  AVAIL  REFER  MOUNTPOINT
>>> tank/installed_src/PKGSRC_2018Q4   414M   250G   414M  
>>> /usr/installed_src/PKGSRC_2018Q4
>>> 
>>> 
>>> These are DOMUs running NetBSD 9.0_STABLE from around September.  I have
>>> not tried this with -current, but there are no crashes for me with 9.x.
> 
> [snip]
> 
>> 
>> I got it ---
>> 
>> With the following entry in -etc-exports:
>> 
>> /tank/t1 -maproot=0:10 -network 192.168.0/24
>> 
>> the NFS server crashes when /tank/t1 is zfs system.
>> 
>> With the following one:
>> 
>> /tank/t1 -maproot=root -network 192.168.0/24
>> 
>> it works fine.
>> 
>> Mind you, '-maproot=0:10' is the first example from 'man exports' ...

The trigger is '-maproot' with group(s), first bug is mountd leaving
'cr_gid' as -2 and setting the first group list member to 10 in this case.

Second bug is ZFS setting illegal group id -2 aka 4294967294 to GID_NOBODY
with id -2.  Later this illegal id leads to null pointer dereference
in zfs_log_create() at zfs_log.c:297 "lr->lr_gid = fuidp->z_fuid_group"
where fuidp is NULL.

With the attached diff the ZFS bug gets fixed and your export works.

> Glad to see that it isn't totally broken.  I am by no means an expert in
> the ZFS code, and I am not in a position to take a lot of time looking
> at it right now, but if the trace back in the PR is correct, it makes it
> almost totally though the mkdir call and crashes in the log create step
> after the directory node is created.  I am trying not to speculate too
> much here, but the code may fail to handle the group in the exports
> line.
> 
> 
> 
> 
> 
> 
> --
> Brad Spencer - b...@anduin.eldar.org - KC8VKS - http://anduin.eldar.org

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


zfs_context.h.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: Automated report: NetBSD-current/i386 test failure

2020-06-16 Thread J. Hannken-Illjes
> On 16. Jun 2020, at 12:42, NetBSD Test Fixture  wrote:
> 
> This is an automatically generated notice of new failures of the
> NetBSD test suite.
> 
> The newly failing test cases are:
> 
>fs/vfs/t_full:nfs_fillfs

[snip]

>2020.06.14.23.38.25 kamil src/sys/rump/include/rump/rump.h,v 1.72

This commit seems to be the cause ...

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: panic on zpool create

2020-01-07 Thread J. Hannken-Illjes
> On 6. Jan 2020, at 15:04, David Brownlee  wrote:
> 
> I've just tried to create a zfs pool and had a panic (tried twice, was
> gifted a kernel core each time). This is on latest NetBSD-9.0_RC1 from
> nyftp (Thu Jan  2 10:02:26 UTC 2020)
> 
> The command were "zpool create angus_media wd0" and "zpool create -f
> angus_media wd0 wd1"
> 
> Moved disks to another machine (On which I'd used zfs before), on
> which the latter command completed fine (modulus wd1 and wd2 as
> different device numbers).
> 
> Disks were 6TB with any labels/gpt blanked.
> 
> Bit of a puzzler...
> 
> crash reports:
> 
> _KERNEL_OPT_NARCNET() at 0
> ?() at a9826c8f4000
> vpanic() at vpanic+0x169
> snprintf() at snprintf
> startlwp() at startlwp
> calltrap() at calltrap+0x11
> fstrans_start() at fstrans_start+0x64
> VOP_LOCK() at VOP_LOCK+0x52
> vn_lock() at vn_lock+0x11
> secmodel_extensions_system_cb() at secmodel_extensions_system_cb+0x70
> kauth_authorize_action() at kauth_authorize_action+0xaa
> kauth_authorize_system() at kauth_authorize_system+0x28
> zfs_mount() at zfs_mount+0xcf
> VFS_MOUNT() at VFS_MOUNT+0x4d
> mount_domount() at mount_domount+0xdf
> do_sys_mount() at do_sys_mount+0x580
> sys___mount50() at sys___mount50+0x33
> syscall() at syscall+0x157
> --- syscall (number 410) ---
> 7d6364687aba:

For some reason locking the directory we want to mount on crashes.

Anything special with the root on this machine?

Does the directory (/angus_media I suppose) exist?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)


Re: Tar extract behaviour changed

2019-10-22 Thread J. Hannken-Illjes

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig



> On 22. Oct 2019, at 07:26, Martin Husemann  wrote:
> 
> The current state silently breaks existing valid setups ("valid" of course
> in my view, as I personally ran into one that I created myself).

It breaks chrooted services, I got non-working "unbound" and "nsd".

Suppose this will hurt a bunch of installations when they
go from -8 to -9.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Tar extract behaviour changed

2019-10-21 Thread J. Hannken-Illjes
Somewhere between Netbsd-8 and NetBSD-9 "tar" changed its behaviour
when it has to extract a directory and the path exists as a symlink.

The attached script on -8 gives:

NetBSD 8.0_STABLE

== Initial:
total 8
drwxr-xr-x  2 hannken  staff  512 Oct 21 11:47 realtarget
drwxr-xr-x  2 hannken  staff  512 Oct 21 11:47 target

== Change to symlink:
total 4
drwxr-xr-x  2 hannken  staff  512 Oct 21 11:47 realtarget
lrwxr-xr-x  1 hannken  staff   10 Oct 21 11:47 target -> realtarget

== After extract:
total 4
drwxr-xr-x  2 hannken  staff  512 Oct 21 11:47 realtarget
lrwxr-xr-x  1 hannken  staff   10 Oct 21 11:47 target -> realtarget

On -9 it gives:

NetBSD 9.0_BETA

== Initial:
total 4
drwxr-xr-x  2 root  wheel  512 Oct 21 11:48 realtarget
drwxr-xr-x  2 root  wheel  512 Oct 21 11:48 target

== Change to symlink:
total 2
drwxr-xr-x  2 root  wheel  512 Oct 21 11:48 realtarget
lrwxr-xr-x  1 root  wheel   10 Oct 21 11:48 target -> realtarget

== After extract:
total 4
drwxr-xr-x  2 root  wheel  512 Oct 21 11:48 realtarget
drwxr-xr-x  2 root  wheel  512 Oct 21 11:48 target

Here "target" was changed from symlink to directory.

On NetBSD-9 extracting the "base" set overrides the symlink "/etc/unbound"
with a directory and therefore unbound fails to start.

Is this a bug in "tar" or is there a switch to get the old behaviour back?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

#! /bin/sh

uname -sr
mkdir junk
mkdir junk/target junk/realtarget
printf "\n== Initial:\n" ; ls -l junk
tar -c -f junk.tar junk
rmdir junk/target
ln -s realtarget junk/target
printf "\n== Change to symlink:\n" ; ls -l junk
tar -x -f junk.tar
printf "\n== After extract:\n" ; ls -l junk

rm -rf junk junk.tar


Re: i386 9.99.17 build fails for NET4501 kernel

2019-10-17 Thread J. Hannken-Illjes
Any chance we can build x86 kernels without DIAGNOSTIC again?

Does it need a PR?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig



> On 15. Oct 2019, at 17:56, John D. Baker  wrote:
> 
> On Tue, 15 Oct 2019, Ryo ONODERA wrote:
> 
>> If the problem is the compiler bug, the patch like the following
>> may be effective.
> 
>> [snip]
> 
> I applied a similar change for i386 and with it the stock NET4501
> config (w/"options DIAGNOSTIC" commented out) builds successfully.
> 
> +Index: sys/arch/i386/conf/Makefile.i386
> +===
> +RCS file: /cvsroot/src/sys/arch/i386/conf/Makefile.i386,v
> +retrieving revision 1.194
> +diff -u -p -r1.194 Makefile.i386
> +--- sys/arch/i386/conf/Makefile.i38622 Sep 2018 12:24:02 -  1.194
>  sys/arch/i386/conf/Makefile.i38615 Oct 2019 15:52:05 -
> +@@ -44,6 +44,7 @@ CFLAGS+=  -mno-mmx -mno-sse -mno-avx
> + CFLAGS+=   -mindirect-branch=thunk
> + CFLAGS+=   -mindirect-branch-register
> + .endif
> ++COPTS.vm_machdep.c+=   -Wno-error=clobbered
> +
> + ##
> + ## (3) libkern and compat
> 
> 
> 
> 
> --
> |/"\ John D. Baker, KN5UKS   NetBSD Darwin/MacOS X
> |\ / jdbaker[snail]consolidated[flyspeck]net  OpenBSDFreeBSD
> | X  No HTML/proprietary data in email.   BSD just sits there and works!
> |/ \ GPGkeyID:  D703 4A7E 479F 63F8 D3F4  BD99 9572 8F23 E4AD 1645



signature.asc
Description: Message signed with OpenPGP


Re: VFS panic

2019-02-21 Thread J. Hannken-Illjes


> On 21. Feb 2019, at 00:18, Robert Swindells  wrote:
> 
> 
> I'm getting a panic at startup on an evbearmv7hf-el system:
> 
> ...
> Starting sshd.
> Starting inetd.
> Starting cron.
> Wed Feb 20 21:23:46 GMT 2019
> panic: kernel diagnostic assertion "mp != dead_rootmount" failed: file 
> "../../../../kern/vfs_trans.c", line 680 
> cpu1: Begin traceback...
> 0x9ac8dd54: netbsd:db_panic+0x14
> 0x9ac8dd6c: netbsd:vpanic+0x194
> 0x9ac8dd84: netbsd:__aeabi_uldivmod
> 0x9ac8ddbc: netbsd:vfs_suspend+0x1b8
> 0x9ac8dddc: netbsd:vrevoke_suspend_next+0x3c
> 0x9ac8de14: netbsd:vrevoke+0xc4
> 0x9ac8de24: netbsd:genfs_revoke+0x20
> 0x9ac8de4c: netbsd:VOP_REVOKE+0x40
> 0x9ac8df14: netbsd:dorevoke+0x94
> 0x9ac8df34: netbsd:sys_revoke+0x44
> 0x9ac8dfac: netbsd:syscall+0x12c
> cpu1: End traceback...
> Undefined instruction 0xe7ff in kernel at 0x80023534 (LR 0x80265358 SP 
> 0x9ac
> 8dd58)
> Stopped in pid 621.1 (getty) at netbsd:cpu_Debugger:und 0xe7ff
> db{1}>
> 
> This is a -current kernel from sources updated about an hour ago,
> userland is a couple of days old.

Please try again with sys/kern/vfs_trans.c Rev. 1.55

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Kernel crash trying to use union mount

2019-01-19 Thread J. Hannken-Illjes
> On 18. Jan 2019, at 14:13, Tom Ivar Helbekkmo  wrote:
> 
> I just had a really weird crash on a NetBSD/amd64-current system,
> running a kernel 8.99.30 from January 2nd.  Here's what happened:
> 
> I was going to experiment with a rather large set of changes to the
> local copy of the source tree, which I'd want to revert afterwards, so I
> created a directory on another file system, and mounted it on top of
> /usr/src with mount_union.  I then copied a 10MiB diff into /usr/src/.
> That went well - the file was visible in /usr/src/, and I observed that
> it was correctly stored in the auxiliary directory, as expected.
> 
> Then I tried reading the file from /usr/src/, and the system immediately
> crashed, and dumped core, with the panic:
> 
> kernel diagnostic assertion "fli->fli_trans_cnt > 0" failed: file 
> "/usr/src/sys/kern/vfs_trans.c", line 451

The VOP_UNLOCK() doesn't match the corresponding vn_lock().



> fstrans_done() at fstrans_done+0x126
> VOP_UNLOCK() at VOP_UNLOCK+0x5b
> vput() at vput+0x11
> union_lookup1() at union_lookup1+0xfe

This is

while (dvp != udvp && (dvp->v_type == VDIR) &&
   (mp = dvp->v_mountedhere)) {
if (vfs_busy(mp))
        continue;
vput(dvp);

which looks wrong.  Please show your mounted file systems.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig


signature.asc
Description: Message signed with OpenPGP


Re: panic when removing a file in current

2018-07-19 Thread J. Hannken-Illjes
On Thu, Jul 19, 2018 at 01:08:22PM +0200, Johnny Billquist wrote:
> Hmm. That means I need to update user land, which can be a bit scary since it 
> can make a rollback really hard.
> And there is also a chicken and egg thing here. Installing a new user land 
> can potentially mean removing files, which will trigger the panic.
> 
> Is it really motivated with that panic? The system is running without issues 
> on that same file system and NetBSD 7.

You could backport this change to -7 fsck_ffs, the patch (attached) is small.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Index: pass1.c
===
RCS file: /cvsroot/src/sbin/fsck_ffs/pass1.c,v
retrieving revision 1.57
retrieving revision 1.58
diff -p -u -r1.57 -r1.58
--- pass1.c 8 Feb 2017 16:11:40 -   1.57
+++ pass1.c 13 Feb 2018 11:20:08 -  1.58
@@ -253,8 +253,9 @@ checkinode(ino_t inumber, struct inodesc
(memcmp(dp->dp1.di_db, ufs1_zino.di_db,
UFS_NDADDR * sizeof(int32_t)) ||
memcmp(dp->dp1.di_ib, ufs1_zino.di_ib,
-   UFS_NIADDR * sizeof(int32_t ||
-   mode || size) {
+   UFS_NIADDR * sizeof(int32_t
+   ||
+   mode || size || DIP(dp, blocks)) {
pfatal("PARTIALLY ALLOCATED INODE I=%llu",
(unsigned long long)inumber);
if (reply("CLEAR") == 1) {


Re: panic when removing a file in current

2018-07-19 Thread J. Hannken-Illjes



> On 19. Jul 2018, at 03:54, Johnny Billquist  wrote:
> 
> Anyone seen this, or know what it's about?

Great, it took 6 months to trigger my assertion ...

This panic probably means the file contains unallocated inodes that
were only partially zeroed.

Please run "fsck -f" on this file system and look for messages
like "PARTIALLY ALLOCATED INODE".

> On NetBSD/vax, with 8.99.22 from today.
> 
> Removing any file that has disk blocks allocated to it:
> 
> [ 653.3285523] ufs_inactive: unlinked ino 50313 on "/home" has non zero size 
> 0 or blocks 1ac0 with allerror 0
> [ 653.3484633] panic: ufs_inactive: dirty filesystem?
> [ 653.3788284] cpu0: Begin traceback...
> [ 653.3984724] panic: ufs_inactive: dirty filesystem?
> [ 653.4090004] Stack traceback :
> [ 653.4231115]   Process is executing in user space.
> [ 653.4286045] cpu0: End traceback...
> Stopped in pid 39.1 (rm) at netbsd:vpanic+0xc5: pushl   $0
> 
> 
> If a file is small enough to have all the data in the inode itself, rm 
> survives fine.

We never hold file data in inodes, only short sysmlinks.

> 
>  Johnny
> 
> -- 
> Johnny Billquist  || "I'm on a bus
>  ||  on a psychedelic trip
> email: b...@softjar.se     ||  Reading murder books
> pdp is alive! ||  tryin' to stay hip" - B. Idol

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: netbsd-8 hang on tstile

2018-03-07 Thread J. Hannken-Illjes

> On 6. Mar 2018, at 23:33, Manuel Bouyer <bou...@antioche.eu.org> wrote:
> 
> Hello
> on an up-to-date netbsd-8 Xen3 i386PAE kernel I see hangs on tstile.
> Hung processes shows the same pattern, they sleep in fstrans_start():

> 
> This is reproductible, restarting my automatic test script hangs the same
> way. This i plain ffs, no wapbl.
> 
> Any idea ?

Please enter DDB and "call fstrans_dump(0)" to see which thread blocks
the transition (it will have "... shared N ..." with N > 0).

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: FFS panic

2018-01-15 Thread J. Hannken-Illjes
Manuel,

does it help to run clri from fsdb?

We definitely need an assertion of "blocks == 0" on inode deletion.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

> On 15. Jan 2018, at 09:11, Manuel Bouyer <bou...@antioche.eu.org> wrote:
> 
> Hello,
> I get a recuring panic on a netbsd-8 host:
> ffs_newvnode: ino=4 on /dsk/l1: gen 35a8ffda/35a8ffda has non zero blocks 
> af80 or size 0
> panic: ffs_newvnode: dirty filesystem?
> 
> I remember something about a ffsv2 bug, but this filesystem is ffsv1.
> fsck doesn't seem to fix it.
> 
> Any idea ?
> 
> -- 
> Manuel Bouyer <bou...@antioche.eu.org>
> NetBSD: 26 ans d'experience feront toujours la difference
> --


Re: Fixing swap1_stop

2017-08-23 Thread J. Hannken-Illjes

> On 19. Aug 2017, at 14:20, Christos Zoulas <chris...@zoulas.com> wrote:
> 
> On Aug 19,  1:04pm, hann...@eis.cs.tu-bs.de ("J. Hannken-Illjes") wrote:
> -- Subject: Re: Fixing swap1_stop
> 
> | A long time ago forced unmounts tried to change open block device nodes
> | to anonymous (not attached to a file system) nodes.  This was racy and
> | has been removed.
> | 
> | With the recent changes to the VFS subsystem it should be possible to
> | bring this behaviour back and instead of destroying open device nodes
> | a forced unmount would detach them from the file system and keep them
> | active.
> | 
> | Did you mean something like this?
> 
> Yes exactly that.

Committed and pullup to -8 requested:

src/sys/kern/vfs_vnode.c r1.97, r1.98
src/sys/miscfs/deadfs/dead_vfsops.c r1.8
src/sys/kern/vfs_mount.c r1.67
src/sys/sys/vnode_impl.h r1.16

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Fixing swap1_stop

2017-08-19 Thread J. Hannken-Illjes

> On 18. Aug 2017, at 10:16, Robert Elz <k...@munnari.oz.au> wrote:
> 
> After thinking about this (that is, the original problem here,
> not the mount changes, which are useful for other reasons - the
> reason I did the implementation I showed is that I have a very
> similar need in some of my scripts, where I have just been "knowing"
> that I never have weird chars, like spaces, in any of the mount point
> names, up to now.)
> 
> Anyway, after thinking about it for a bit, I am not convinced that
> any fix in the rc scripts is the best way to solve the underlying
> problem - and at best it means a bunch of messy config that users
> would need to maintain.
> 
> Might it not be better instead to fix it in the kernel, make it
> so that if umount -f encounters a device node from the filesystem being
> unmounted, it simply marks the vnode so it is known its home filesystem
> has vanished (pointer to mount-point = NULL most probably -- so no more
> attempts to update the times, which is all that the underlying filesys
> is ever used for after a device is opened) and otherwise leave it alone?
> 
> (A regular umount, without -f, would return EBUSY as normal.)
> 
> That way the device keeps on working, until it is closed, when the vnode
> can just be trashed, and in the meantime, the filesystem it came from can
> be cleanly unmounted (or as cleanly as -f ever permits.)  In the general case
> we really want the umount to happen, as otherwise its parent filesys, which
> might also need unmounting (a tmpfs with devices mounted on a tmpfs that
> has none, but is occupying swap, for example.)
> 
> Wouldn't this solve the original problem in a much simpler and better way?

A long time ago forced unmounts tried to change open block device nodes
to anonymous (not attached to a file system) nodes.  This was racy and
has been removed.

With the recent changes to the VFS subsystem it should be possible to
bring this behaviour back and instead of destroying open device nodes
a forced unmount would detach them from the file system and keep them
active.

Did you mean something like this?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: ffs_vnops.c changes break kernels w/o WAPBL

2017-03-01 Thread J. Hannken-Illjes

> On 1 Mar 2017, at 22:19, John D. Baker <jdba...@mylinuxisp.com> wrote:
> 
> Recent changes to "sys/ufs/ffs/ffs_vnops.c" break building kernels which
> don't include "options WAPBL" (e.g. NET4501).
> 
> The complaint is about "struct mount *mp" being set but not used.
> 
> In the above-mentioned file in "ffs_spec_fsync(void *v)", "struct mount *mp"
> is used only in a block of code guarded with "#ifdef WAPBL".
> 
> The following patch adds the same guard to the declaration and setting
> of "struct mount *mp".  This allows the NET4501 kernel to build.
> 
> +Index: sys/ufs/ffs/ffs_vnops.c
> +===
> +RCS file: /cvsroot/src/sys/ufs/ffs/ffs_vnops.c,v
> +retrieving revision 1.126
> +diff -u -p -r1.126 ffs_vnops.c
> +--- sys/ufs/ffs/ffs_vnops.c  1 Mar 2017 10:42:45 -   1.126
>  sys/ufs/ffs/ffs_vnops.c  1 Mar 2017 20:10:33 -
> +@@ -283,12 +283,16 @@ ffs_spec_fsync(void *v)
> + } */ *ap = v;
> + int error, flags, uflags;
> + struct vnode *vp;
> ++#ifdef WAPBL
> + struct mount *mp;
> ++#endif /* WAPBL */
> + 
> + flags = ap->a_flags;
> + uflags = UPDATE_CLOSE | ((flags & FSYNC_WAIT) ? UPDATE_WAIT : 0);
> + vp = ap->a_vp;
> ++#ifdef WAPBL
> + mp = vp->v_mount;
> ++#endif /* WAPBL */
> + 
> + error = spec_fsync(v);
> + if (error)
> 
> -- 
> |/"\ John D. Baker, KN5UKS   NetBSD Darwin/MacOS X
> |\ / jdbaker[snail]mylinuxisp[flyspeck]comOpenBSD        FreeBSD
> | X  No HTML/proprietary data in email.   BSD just sits there and works!
> |/ \ GPGkeyID:  D703 4A7E 479F 63F8 D3F4  BD99 9572 8F23 E4AD 1645
> 

Committed with slightly modification -- thanks!

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)




Re: 7.99.50 complete hang

2016-12-19 Thread J. Hannken-Illjes

> On 19 Dec 2016, at 14:31, Joerg Sonnenberger <jo...@bec.de> wrote:
> 
> On Sun, Dec 18, 2016 at 09:55:58PM +0100, J. Hannken-Illjes wrote:
>> 
>>> On 18 Dec 2016, at 21:49, Joerg Sonnenberger <jo...@bec.de> wrote:
>>> 
>>> On Sun, Dec 18, 2016 at 09:45:00PM +0100, Joerg Sonnenberger wrote:
>>>> On Fri, Dec 16, 2016 at 01:14:10AM +0100, Thomas Klausner wrote:
>>>>> When I start my chrooted bulkbuild, the system completely stops. It
>>>>> prints a couple of dots (like when it farms out the first steps of the
>>>>> dependency chain computation) and then nothing. When I try to open a
>>>>> second shell in screen, screen locks up completely.
>>>> 
>>>> In my case, the scan finished, but it dead locked as soon as it tries to
>>>> write to binary packages. This worked fine with a kernel from the ~Dec 8
>>>> sources.
>>> 
>>> Comparing the ident output makes me suspect the vnode changes on Dec
>>> 14th. Juergen?
>> 
>> Please revert sys/kern/vfs_vnode.c to Rev 1.62 to make sure it is the
>> result of this commit.
> 
> No hang yet with 1.62.

Ok, there is a problem with vdrain_vrele().  If all tests pass I will
commit a fix tomorrow.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: 7.99.50 complete hang

2016-12-18 Thread J. Hannken-Illjes

> On 18 Dec 2016, at 21:49, Joerg Sonnenberger <jo...@bec.de> wrote:
> 
> On Sun, Dec 18, 2016 at 09:45:00PM +0100, Joerg Sonnenberger wrote:
>> On Fri, Dec 16, 2016 at 01:14:10AM +0100, Thomas Klausner wrote:
>>> When I start my chrooted bulkbuild, the system completely stops. It
>>> prints a couple of dots (like when it farms out the first steps of the
>>> dependency chain computation) and then nothing. When I try to open a
>>> second shell in screen, screen locks up completely.
>> 
>> In my case, the scan finished, but it dead locked as soon as it tries to
>> write to binary packages. This worked fine with a kernel from the ~Dec 8
>> sources.
> 
> Comparing the ident output makes me suspect the vnode changes on Dec
> 14th. Juergen?

Please revert sys/kern/vfs_vnode.c to Rev 1.62 to make sure it is the
result of this commit.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: repeated failure to properly shutdown

2016-07-22 Thread J. Hannken-Illjes

> On 22 Jul 2016, at 10:39, Robert Elz  wrote:
> 
>Date:Thu, 21 Jul 2016 16:38:57 -0700
>From:bch 
>Message-ID:  
> 

Re: repeated failure to properly shutdown

2016-07-21 Thread J. Hannken-Illjes

> On 21 Jul 2016, at 19:26, co...@sdf.org wrote:
> 
> On Thu, Jul 21, 2016 at 10:15:32AM -0700, bch wrote:
>> Jul 20 23:55:59 kamloops /netbsd: wapbl_discard() at 
>> netbsd:wapbl_discard+0x20c
>> Jul 20 23:55:59 kamloops /netbsd: vclean() at netbsd:vclean+0x2ae
> ...
>> Jul 20 23:55:59 kamloops /netbsd: tmpfs_unmount() at 
>> netbsd:tmpfs_unmount+0x2f
> 
> For some reason this condition is met:
> wapbl_vphaswapbl(vp)
> 
> but why?

The contents of this "struct vnode", especially its "v_tag"
and "v_mount" could help.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: nbctfmerge runs for hours on custom i386 kernels

2016-03-07 Thread J. Hannken-Illjes

> On 02 Mar 2016, at 00:58, John D. Baker <jdba...@mylinuxisp.com> wrote:
> 
> On Tue, 1 Mar 2016, John D. Baker wrote:
> 
>> I have not observed this behavior when building any of the standard
>> kernels, but only most if not all of my custom kernels (which simply
>> include GENERIC and exclude unecessary items with the "no foo at bar"
>> mechanism).
> 
> I should note that I build all my custom kernels with the "kernel.gdb=FOO"
> mechanism to produce "FOO/netbsd.gdb", in case that has any bearing on
> the situation.

Should be fixed now with Revision 1.4 of
external/bsd/elftoolchain/dist/libdwarf/libdwarf_elf_init.c

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: nbctfmerge runs for hours on custom i386 kernels

2016-03-06 Thread J. Hannken-Illjes

> On 03 Mar 2016, at 19:44, Christos Zoulas <chris...@astron.com> wrote:
> 
> In article <5e50530c-0628-46e0-9457-440804e21...@eis.cs.tu-bs.de>,
> J. Hannken-Illjes <hann...@eis.cs.tu-bs.de> wrote:
>> 
>>> On 02 Mar 2016, at 09:11, Martin Husemann <mar...@duskware.de> wrote:
>>> 
>>> On Tue, Mar 01, 2016 at 05:42:51PM -0600, John D. Baker wrote:
>>>> I have so-far observed this only on i386 and not any of the other
>>>> architectures I build.
>>> 
>>> This is probably caused by ld.elf_so bugs. There is a pullup request
>>> pending to (hopefully) fix this.
>>> 
>>> A simple test (and easy workaround) is to extract usr/libexec/ld.elf_so
>>> from a -current i386 base.tgz and put that on your machine.
>> 
>> For me it is nbctfconvert that creates bad ctf sections on i386 and makes
>> nbctfmerge run for an hour on debug (-g) kernels.
>> 
>> Reverting this
>> 
>> @@ -108,5 +122,6 @@ _dwarf_elf_relocate(Dwarf_Debug dbg, Elf
>>   }
>> 
>> -   if (sh.sh_type != SHT_RELA || sh.sh_size == 0)
>> +   if ((sh.sh_type != SHT_REL && sh.sh_type != SHT_RELA) ||
>> +sh.sh_size == 0)
>>   continue;
>> 
>> from the recent change to
>> external/bsd/elftoolchain/dist/libdwarf/libdwarf_elf_init.c
>> makes my builds happy again.
>> 
>> Looks like the the .debug_info section gets modified to always return
>> the string at offset 0. 
> 
> This is part of this change:
> https://svnweb.freebsd.org/base/head/contrib/elftoolchain/libdwarf/libdwarf_elf_init.c?r1=278593=278611

Yes — and it looks wrong, at least for out i386 objects that use “SHT-REL”
for “.debug_info” where amd64 for example uses “SHT_RELA".

According to the “TIS ELF Specification” page 1-23:

As shown above, only Elf32_Rela entries contain an explicit addend.
Entries of type Elf32_Rel store an implicit addend in the location
to be modified. Depending on the processor architecture, one form
or the other might be necessary or more convenient. Consequently,
an implementation for a particular machine may use one form
exclusively or either form depending on context.

but function libdwarf_elf_init.c::_dwarf_elf_apply_rel_reloc() ignores
this “implicit addend” and treats it as zero.

Take for example file vers.o from a “-g” i386 kernel build:

RELOCATION RECORDS FOR [.debug_info]:
OFFSET   TYPE  VALUE 
0006 R_386_32  .debug_abbrev
000c R_386_32  .debug_str
0011 R_386_32  .debug_str

where .debug_info looks like:

Contents of section .debug_info:
 b201 0400 00000401 1501
0010 01270200 007f  00020106

Here location 0x000c is not zero but this value gets overwritten
with zero.  This leads to all strings to be relocated to zero.

Running “nbctfconvert” with “env CTFCONVERT_DEBUG_LEVEL=9” will
show it in more detail.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)




Re: nbctfmerge runs for hours on custom i386 kernels

2016-03-03 Thread J. Hannken-Illjes

> On 02 Mar 2016, at 09:11, Martin Husemann <mar...@duskware.de> wrote:
> 
> On Tue, Mar 01, 2016 at 05:42:51PM -0600, John D. Baker wrote:
>> I have so-far observed this only on i386 and not any of the other
>> architectures I build.
> 
> This is probably caused by ld.elf_so bugs. There is a pullup request
> pending to (hopefully) fix this.
> 
> A simple test (and easy workaround) is to extract usr/libexec/ld.elf_so
> from a -current i386 base.tgz and put that on your machine.

For me it is nbctfconvert that creates bad ctf sections on i386 and makes
nbctfmerge run for an hour on debug (-g) kernels.

Reverting this

@@ -108,5 +122,6 @@ _dwarf_elf_relocate(Dwarf_Debug dbg, Elf
}
 
-   if (sh.sh_type != SHT_RELA || sh.sh_size == 0)
+   if ((sh.sh_type != SHT_REL && sh.sh_type != SHT_RELA) ||
+sh.sh_size == 0)
continue;

from the recent change to 
external/bsd/elftoolchain/dist/libdwarf/libdwarf_elf_init.c
makes my builds happy again.

Looks like the the .debug_info section gets modified to always return
the string at offset 0. 

Running nbctfconvert with "CTFCONVERT_DEBUG_LEVEL=9” I get from an amd64
object:

DEBUG: NO stabs: .stab=-1, .stabstr=0
DEBUG: DWARF version: 4
DEBUG: DWARF emitter: GNU C 4.8.5 -mcmodel=kernel -mno-red-zone -mno-mmx 
-mno-sse -mno-avx -msoft-float -mtune=nocona -march=x86-64 -g -O2 -std=gnu99 
-std=gnu99 -ffreestanding -fno-zero-initialized-in-bss -fno-omit-frame-pointer 
-fstack-protector -fno-strict-aliasing -fno-common --param ssp-buffer-size=1
DEBUG: CU name: vers.c
DEBUG: die 29 <0x1d>: create_one
DEBUG: die 29: creating base type
DEBUG: die 29: name "signed char" remapped to "char"

and from an i386 object:

DEBUG: NO stabs: .stab=-1, .stabstr=0
DEBUG: DWARF version: 4
DEBUG: DWARF emitter: long long int
DEBUG: CU name: long long int
DEBUG: die 29 <0x1d>: create_one
DEBUG: die 29: creating base type
DEBUG: die 29: name "long long int" remapped to "long long"


--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)




Re: NFS related panics and hangs

2015-11-05 Thread J. Hannken-Illjes
On 05 Nov 2015, at 21:48, Rhialto <rhia...@falu.nl> wrote:


> Looking into this:
> 
> the occurrences of nfs_reqq are as follows:
> 
> fs/nfs/client/nfs_clvnops.c: * nfs_reqq_mtx : Global lock, protects the 
> nfs_reqq list.
> 
> Since there is no other mention of nfs_reqq_mtx in the whole syssrc
> tarball, this looks wrong.  It also immediately causes the suspicion
> that the list isn't in fact protected at all.

This file (fs/nfs/client/nfs_clvnops.c) is part of a second (dead) nfs
implementation from FreeBSD.  It is not part of any kernel.

Our nfs lives in sys/nfs.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Killing a zombie process?

2015-10-16 Thread J. Hannken-Illjes
On 15 Oct 2015, at 00:21, Rhialto <rhia...@falu.nl> wrote:

> On Wed 14 Oct 2015 at 09:39:40 +0200, J. Hannken-Illjes wrote:
>> Looks like a deadlock, two threads in tstile.
>> 
>> Please take a backtrace (with arguments) of these threads.
> 
> I've got a whole lot more in tstile, and that is even just from running
> pkg_comp in the chroot. I didn't try to interrupt anything yet.
> 
> load averages:  0.00,  0.20,  0.44;   up 0+02:23:43
> 22:43:52
> 78 processes: 76 sleeping, 2 on CPU
> CPU states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
> Memory: 393M Act, 60K Inact, 31M Wired, 31M Exec, 273M File, 3239M Free
> Swap: 4096M Total, 4096M Free
> 
> 
> vargaz:~$ ps alxtp1
> UID   PID  PPID   CPU PRI NI   VSZ   RSS WCHAN   STAT TTY  TIME COMMAND
> 1000  139174 0  85  0 13208  2528 waitIs   ttyp1 0:00.02 -bash
>   0  1759  1391  1107  85  0 13304  1576 waitIttyp1 0:00.13 /bin/sh 
> /usr/pkg/sbin/pkg_comp chroot
>   0   865  1759  1107  85  0 13304  1140 waitIttyp1 0:00.01 /bin/sh 
> /pkg_comp/tmp/pkg_comp-sOjsoA.sh
>   0   874   865 13547  82  0 11088  1412 pause   Ittyp1 0:00.01 /bin/ksh
>   0   267   874 20048  81  0 15360  1720 waitI+   ttyp1 0:00.22 /bin/sh 
> -e /usr/pkg/sbin/pkg_chk
>   0  9782   267 20048  81  0 15360  1448 waitI+   ttyp1 0:00.00 sh -c cd 
> /usr/pkgsrc/devel/mercurial && /usr/bin/make u
>   0  8085  9782 0 117  0 15224  3452 tstile  D+   ttyp1 0:00.14 
> /usr/bin/make update CLEANDEPENDS
>   0 26889  8085 29745  78  0 15360  1424 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e; /usr/bin/env MAKECONF=/etc/mk.conf P
>   0 14050 26889 0 117  0 15224  3444 tstile  D+   ttyp1 0:00.14 
> /usr/bin/make _MAKE OPSYS OS_VERSION LOWER_OPSYS _PKGSR
>   0  6325 14050 22699  80  0 15360  1428 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e; pkgpattern=mercurial-3.5.1;\t\t\t\t
>   0 13334  6325 0 117  0 15224  3452 tstile  D+   ttyp1 0:00.14 
> /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS HOST_OSTYPE
>   0  2892 13334 29745  78  0 15364  1444 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e;\t\t\t\t\t\t\t\t exec 3<&0;\t\t\t\t\t
>   0 13425  2892 29745  78  0 15364  1136 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e;\t\t\t\t\t\t\t\t exec 3<&0;\t\t\t\t\t
>   0 17339 13425 0 117  0 15224  3504 tstile  D+   ttyp1 0:00.16 
> /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS DEPENDS_TARG
>   0 11893 17339 23601  80  0 15364  1432 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e; pkgpattern=py27-mercurial\\>=3.5.1;\
>   0 21797 11893 0 117  0 15228  3512 tstile  D+   ttyp1 0:00.18 
> /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS DEPENDS_TARG
>   0  1347 21797 23778  80  0 15364  1456 waitI+   ttyp1 0:00.00 /bin/sh 
> -c set -e;\t\t\t\t\t if test -n "" &&  /usr/pkg
>   0 23567  1347 0 117  0 15228  4032 tstile  D+   ttyp1 0:00.38 
> /usr/bin/make .MAKE.LEVEL.ENV CLEANDEPENDS DEPENDS_TARG
>   0  3383 23567 29360  78  0 15364  1432 waitI+   ttyp1 0:00.00 /bin/sh 
> -c (cd /pkg_comp/obj/pkgsrc/devel/py-mercurial/
>   0 21311  3383 28277  79  0 81652 11580 waitI+   ttyp1 0:00.14 
> /usr/pkg/bin/python2.7 setup.py build
>   0 24114 21311 28277  79  0 15364  1424 waitI+   ttyp1 0:00.01 /bin/sh 
> /pkg_comp/obj/pkgsrc/devel/py-mercurial/default
>   0  3590 24114 28277  79  0 15364  1472 waitI+   ttyp1 0:00.00 /bin/sh 
> /usr/pkgsrc/mk/tools/msgfmt.sh
>   0  7060  3590 28277 117  0  4244   188 tstile  D+   ttyp1 0:00.00 /bin/cat
>   0 18497  3590 28277  79  0 10880  1064 pipe_wr I+   ttyp1 0:00.00 /bin/cat 
> i18n/el.po
>   0 23883  3590 0 117  0  6580   236 netio   D+   ttyp1 0:00.00 
> /usr/bin/msgfmt -v -o mercurial/locale/el/LC_MESSAGES/h
>   0 27257  3590 28277 117  0  4244   188 tstile  D+   ttyp1 0:00.00 /bin/cat
>   0 29472  3590 28277  79  0 14244  2344 pipe_wr I+   ttyp1 0:00.01 
> /usr/bin/awk -f /usr/bin/awk
> 
> (I've re-arranged the order to get parents before children)
> 
> Here are backtraces of the processes in tstile (and the shell that
> spawned the 4 leaf children). I have kept the dump so I can examine it
> further.
> 
> Unfortunately, crash(8) didn't give me arguments, nor did ddb when I
> tried that (I used the GENERIC kernel, what options do I need to get the
> arguments?)
> 
> Script started on Wed Oct 14 23:41:43 2015
> vargaz:~/crash$ crash -M netbsd.3.core -N netbsd.test
> Crash version 7.0, image version 7.99.21.
> WARNING: versions differ, you may not be able to examine this image.
> System panicked: dump forced via kernel debugger
> Backtrace from time of crash is available.
> 
> 
> crash> bt/t 0t3590
> trace: pid 3590 lid 1 at 0xff

Re: Killing a zombie process?

2015-10-16 Thread J. Hannken-Illjes
On 16 Oct 2015, at 13:44, Rhialto <rhia...@falu.nl> wrote:

> On Thu 15 Oct 2015 at 20:12:44 +0200, Rhialto wrote:
>> On Thu 15 Oct 2015 at 06:57:42 +0700, Robert Elz wrote:
>>> Do you really need that mounted twice like that, and if not, can you try
>>> with one of them missing and see if the problem remains ?
>> 
>> Good idea, I'll try that later!
> 
> "Interesting" results: it built packages overnight (from around 22:30 to
> 12:13, so for nearly 14 hours), then, when I didn't look, it rebooted.

With panic?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Killing a zombie process?

2015-10-14 Thread J. Hannken-Illjes

On 14 Oct 2015, at 00:20, Rhialto <rhia...@falu.nl> wrote:

> I may have something similar; with 7.0/amd64 GENERIC kernel.
> 
> I've been doing builds in pkg_comp with the chroot directory and /usr/pkgsrc
> mounted over nfs. After some packages, some processes simply don't terminate.
> 
> Some of my processes are now (after trying to exit pkg_comp which hangs)
> 
> UID   PID  PPID  CPU PRI NIVSZ   RSS WCHAN   STAT TTY   TIME COMMAND
>   0   402 10  85  0  15360  1428 waitIpts/2  0:00.00 /bin/sh 
> -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
> -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f
> 1000   683  29070  85  0  13224  2588 waitIs   pts/2  0:00.03 -bash
>   0  2847   683  257 117  0  13304  1576 tstile  D+   pts/2  0:00.02 /bin/sh 
> /usr/pkg/sbin/pkg_comp chroot
>   0 14284 10  85  0  15360  1428 waitIpts/2  0:00.00 /bin/sh 
> -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
> -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f
>   0 26291   402  708 117  0  15360  1004 tstile  Dpts/2  0:00.00 /bin/sh 
> -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
> -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f
>   0 28266 142840 116  0  15360  1004 netio   Dpts/2  0:00.01 /bin/sh 
> -c set -e; /usr/bin/find /pkg_comp/packages/*/lame-3.99.5nb3.tgz -type l 
> -print\t 2>/dev/null | /usr/bin/xargs /bin/rm -f
> 
> No zombies involved, though.

Looks like a deadlock, two threads in tstile.

Please take a backtrace (with arguments) of these threads.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: coretemp0: workqueue busy: updates stopped

2015-06-24 Thread J. Hannken-Illjes
On 24 Jun 2015, at 10:51, Paul Goyette p...@vps1.whooppee.com wrote:

snip
 
 There is a rather interesting mutex-dance in sme_check_events() about
 which I need to think:
 
   mutex_enter(wq_mutex)
   check for empty wq
   mutex_exit(wq_mutex)
 
   mutex_enter(global_sysmon_mutex)
   mutex_enter(wq_mutex)
   queue up the wq entries
   mutex_exit(wq_mutex)
   check for low_power condition
   mutex_exit(global_sysmon_mutex)
 
 I'm pretty sure this can be reduced a bit:
 
   mutex_enter(global_sysmon_mutex)
   mutex_enter(wq_mutex)
   check for empty wq
 
   queue up the wq entries
   mutex_exit(wq_mutex)
   check for low_poer condition
   mutex_exit(global_sysmon_mutex)

It can't, see rev. 1.114:

Add a counter of busy events and stop enqueueing more work if a device is busy.
Protect this counter with a new short time lock sme_work_mtx and
keep sme_mtx as long time lock.

Removes a deadlock where an active event holds sme_mtx, the callout
sme_events_check blocks on sme_mtx and callout processing stops.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: coretemp0: workqueue busy: updates stopped

2015-06-23 Thread J. Hannken-Illjes
On 23 Jun 2015, at 12:01, John D. Baker jdba...@mylinuxisp.com wrote:

 Last night upon updating my file server from 7.0_BETA to 7.0_RC1 (amd64),
 the message in the Subject: line appeared during the shutdown/reboot
 sequence and the machine was stuck there.

Does it happen on every reboot or did you see it once?

Backtrace (bt) and status (ps /l) from ddb would help.

 I dropped to DDB and issued
 the reboot command.  While the filesystem on the raidframe RAID (RAID-R)
 was unmounted, the RAID itself had not yet been detached/un-configured.
 The forcible shutdown required a parity rebuild upon reboot.
 
 Has anyone experienced a similar hang on shutdown/reboot?  I didn't
 bother recording the backtrace in DDB as I just wanted my file server
 back up and running...
 
 -- 
 |/\ John D. Baker, KN5UKS   NetBSD Darwin/MacOS X
 |\ / jdbaker[snail]mylinuxisp[flyspeck]comOpenBSDFreeBSD
 | X  No HTML/proprietary data in email.   BSD just sits there and works!
 |/ \ GPGkeyID:  D703 4A7E 479F 63F8 D3F4  BD99 9572 8F23 E4AD 1645
 

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: DoS attack against TCP services

2015-03-13 Thread J. Hannken-Illjes
On 12 Mar 2015, at 20:59, Christos Zoulas chris...@zoulas.com wrote:

 |  | Now we have a deadlock, softlck/0 waits for the mutex and therefore
 |  | callouts will no longer be processed and ciss holds the mutex and waits
 |  | for a callout through cv_timedwait.
 |
 |  Thanks for looking into it! Part of the ciss_ioctl_vol() (the pdid part)
 |  does things with XS_CTL_POLL so that it does not involve any mutexes. It
 |  would be simple to change the ldid part to do the same. Should we do that?

The mutex involved is the sme_mtx protecting the struct sysmon_envsys, so
our problem doesn't come from missing POLL.

 |  | - Sleeping up to 60 seconds in a function used by a callout is wrong.
 |  
 |  Yes, but many disk drivers seem to violate that. How do we fix this?
 |  Making a separate thread that updates statistics for each driver seems
 |  suboptimal?

We already have it.  If I understand sysmon right, it is already based on
workqueues (the ciss0 thread here):

The workqueue updates sensors every sme-sme_events_timeout seconds, default
is 30 seconds.  Workqueue items get enqueued from a callout.

Both running a workqueue item and processing the callout locks the
same mutex sme-sme_mtx.

For this to work the workqueue must complete before the callout fires:

sme-sme_nsensors * ccb-ccb_xs-timeout  sme-sme_events_timeout

In our ciss case we could set:

sc-sc_sme-sme_events_timeout = 30

ccb-ccb_xs-timeout= 20 / sc-maxunits

to become safe.

Hope I got this right so far ...

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Re: DoS attack against TCP services

2015-03-13 Thread J. Hannken-Illjes

On 13 Mar 2015, at 13:03, Christos Zoulas chris...@zoulas.com wrote:

 On Mar 13,  1:00pm, hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) wrote:
 -- Subject: Re: DoS attack against TCP services
 
 | This would be simple, changing dev/ic/ciss.c like:
 | 
 | sc-sc_sme-sme_name =3D device_xname(sc-sc_dev);
 | sc-sc_sme-sme_cookie =3D sc;
 | sc-sc_sme-sme_refresh =3D ciss_sensor_refresh;
 | +   sc-sc_sme-sme_events_timeout =3D 60;
 | 
 | should do the job.  Unfortunately I have no hardware to test.
 
 Yes, but is 60 enough? Leaving the calculation to each driver
 is potentially dangerous. Could we make it self adjusting?

This was just an idea ... Maybe

...xs..timeout * sc-maxunits + 10

and set xs timeout to 1 .. 5 seconds?

I don't think it is possible to make it self adjusting as the
sysmon framework doesn't know the drivers timeouts.

 |  Nevertheless, I think that the big problem with ciss is now
 |  fixed (i.e. it will not hang forever anymore)...
 | 
 | It may still wait longer than 30 seconds with the sme_mutex held
 | leading to deadlock.
 | 
 | We should use a suitable xs timeout vs. events timeout to make it safe,
 | either increase the event timeout or decrease the xs timeout.
 
 It would be nice if it was safe by default, and it should spam the
 kernel if it was late so that we know about it...

Unfortunately it may deadlock BEFORE it finds a non-empty workqueue.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: DoS attack against TCP services

2015-03-13 Thread J. Hannken-Illjes
On 13 Mar 2015, at 12:53, Christos Zoulas chris...@zoulas.com wrote:

 On Mar 13, 11:19am, hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) wrote:
 -- Subject: Re: DoS attack against TCP services
 
 | The mutex involved is the sme_mtx protecting the struct sysmon_envsys, so
 | our problem doesn't come from missing POLL.
 
 That's what I thought.
 
 | We already have it.  If I understand sysmon right, it is already based on
 | workqueues (the ciss0 thread here):
 | 
 | The workqueue updates sensors every sme-sme_events_timeout seconds, defaul=
 | t
 | is 30 seconds.  Workqueue items get enqueued from a callout.
 | 
 | Both running a workqueue item and processing the callout locks the
 | same mutex sme-sme_mtx.
 | 
 | For this to work the workqueue must complete before the callout fires:
 | 
 | sme-sme_nsensors * ccb-ccb_xs-timeout  sme-sme_events_timeout
 | 
 | In our ciss case we could set:
 | 
 | sc-sc_sme-sme_events_timeout =3D 30
 | 
 | ccb-ccb_xs-timeout=3D 20 / sc-maxunits
 | 
 | to become safe.
 | 
 | Hope I got this right so far ...
 
 Yes, you do. We could decrease the timeout for probing, but that might
 lead to unsuccessful sensor reads. Even then perhaps there is a place
 to have a special mode for sysmon to use a separate thread for reading
 the sensors of a particular driver, or a way to change the sysmon period
 to be longer.

This would be simple, changing dev/ic/ciss.c like:

sc-sc_sme-sme_name = device_xname(sc-sc_dev);
sc-sc_sme-sme_cookie = sc;
sc-sc_sme-sme_refresh = ciss_sensor_refresh;
+   sc-sc_sme-sme_events_timeout = 60;

should do the job.  Unfortunately I have no hardware to test.

 Nevertheless, I think that the big problem with ciss is now
 fixed (i.e. it will not hang forever anymore)...

It may still wait longer than 30 seconds with the sme_mutex held
leading to deadlock.

We should use a suitable xs timeout vs. events timeout to make it safe,
either increase the event timeout or decrease the xs timeout.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: DoS attack against TCP services

2015-03-12 Thread J. Hannken-Illjes
On 28 Feb 2015, at 21:05, Christos Zoulas chris...@zoulas.com wrote:

 On Feb 28,  8:26pm, 6b...@6bone.informatik.uni-leipzig.de 
 (6b...@6bone.informatik.uni-leipzig.de) wrote:
 -- Subject: Re: DoS attack against TCP services
 
 | On Sat, 28 Feb 2015, Christos Zoulas wrote:
 | 
 |  Yes, that's a good start but we need to find which process that
 |  lwp belongs to.
 | 
 | I'm not sure what the best course of action is. The machine is still 
 | running. Should you try to get the information from the current system or 
 | force a dump and analyze this?
 | 
 | On Sat, 28 Feb 2015, J. Hannken-Illjes wrote:
 | 
 |  Looks unlocked -- what about a backtrace of thread 0.5,
 |  bt /a 0xfe882df11860
 | 
 | https://www.ipv6.uni-leipzig.de/bt_0xfe882df11860.png
 
 So who else is holding the sysmon sme_mtx?

Analyzed a crash dump and found two threads deadlocked.

0   77 3   0   200   fe813b495b60  ciss0 ciss_cmd
05 3   0   200   fe882df11860  softclk/0 tstile

Backtrace of softclk/0:

...
3 mutex_vector_enter sys/kern/kern_mutex.c:682
4 sme_events_check   sys/dev/sysmon/sysmon_envsys_events.c:734
5 callout_softclock  sys/kern/kern_timeout.c:743
6 softint_executesys/kern/kern_softint.c:589
...

Here the event struct sme is:

sme_name = ciss0
sme_mtx.u.mtxa_owner = 0xfe813b495b62 (Thread ciss0)

Backtrace of ciss0:

...
 2  cv_timedwait sys/kern/kern_condvar.c:261
 3  ciss_cmd sys/dev/ic/ciss.c:542
 4  ciss_ldidsys/dev/ic/ciss.c:883
 5  ciss_ioctl_vol   sys/dev/ic/ciss.c:1388
 6  ciss_sensor_refresh  sys/dev/ic/ciss.c:1544
 7  sysmon_envsys_refresh_sensor sys/dev/sysmon/sysmon_envsys.c:2027
 8  sme_events_workersys/dev/sysmon/sysmon_envsys_events.c:769
 9  workqueue_runlistsys/kern/subr_workqueue.c:104
10 workqueue_worker  sys/kern/subr_workqueue.c:135
...

The sme mutex was locked from sme_events_worker at sysmon_envsys_events.c:760.

Now we have a deadlock, softlck/0 waits for the mutex and therefore
callouts will no longer be processed and ciss holds the mutex and waits
for a callout through cv_timedwait.

Taking a closer look at the poll loop from sys/dev/ic/ciss.c:537 ... this
code looks wrong in many aspects:

- Sleeping up to 60 seconds in a function used by a callout is wrong.

- Examining variables here we get: tick = 1, etick = 16000,
  tohz = 6000 and i = 599.  As tick is constant (us per hz)
  this loop might run for 599*60 seconds!

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: DoS attack against TCP services

2015-03-12 Thread J. Hannken-Illjes
On 12 Mar 2015, at 20:00, Christos Zoulas chris...@zoulas.com wrote:

 On Mar 12, 12:20pm, hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) wrote:
 -- Subject: Re: DoS attack against TCP services
 
 | Now we have a deadlock, softlck/0 waits for the mutex and therefore
 | callouts will no longer be processed and ciss holds the mutex and waits
 | for a callout through cv_timedwait.
 
 Thanks for looking into it! Part of the ciss_ioctl_vol() (the pdid part)
 does things with XS_CTL_POLL so that it does not involve any mutexes. It
 would be simple to change the ldid part to do the same. Should we do that?
 
 | Taking a closer look at the poll loop from sys/dev/ic/ciss.c:537 ... this
 | code looks wrong in many aspects:
 | 
 | - Sleeping up to 60 seconds in a function used by a callout is wrong.
 
 Yes, but many disk drivers seem to violate that. How do we fix this?
 Making a separate thread that updates statistics for each driver seems
 suboptimal?
 
 | - Examining variables here we get: tick =3D 1, etick =3D 16000,
 |   tohz =3D 6000 and i =3D 599.  As tick is constant (us per hz)
 |   this loop might run for 599*60 seconds!
 
 I committed a fix for this. Now it should only sleep up to 60 seconds.

Looks like you made it worse.

tick is constant, for HZ == 100 it is 1 so you now have

etick = tick + tohz - etick = 1 + tohz

and then

tohz = etick - tick - tohz = (1 + tohz) - 1 - tohz = tohz

so ciss_wait() may now loop forever.  Are you looking for hardclock_ticks?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: rump i386 cross compile trouble

2015-03-07 Thread J. Hannken-Illjes
On 07 Mar 2015, at 18:03, Patrick Welche pr...@cam.ac.uk wrote:

 On Tue, Mar 03, 2015 at 02:56:16PM +, Patrick Welche wrote:
 On Tue, Mar 03, 2015 at 12:10:15PM +, Patrick Welche wrote:
 Not having much luck.. with today's source I see:
 No DBG / optimisation anywhere. Additions to /etc/mk.conf:
 
 RUMP_DIAGNOSTIC=yes
 RUMP_DEBUG=yes
 RUMP_LOCKDEBUG=yes
 RUMP_KTRACE=yes
 
 Removing these gets a successful build - that narrows it down a bit...
 
 Bisection just yielded a surprise: the build with RUMP_DEBUG=yes was
 broken by sys/kern/kern_module.c
 
 revision 1.103
 date: 2015-02-28 23:04:34 +;  author: jmcneill;  state: Exp;  lines: +4 
 -3; 
 commitid: X5g1KIdu4fu6uPby;
 if the root file-system is not yet mounted, hide vfs load failed spam with 
 opt
 ions DEBUG
 
 
 -   if (modclass != MODULE_CLASS_EXEC || error != ENOENT)
 +   if ((modclass != MODULE_CLASS_EXEC || error != ENOENT) 
 +   root_device != NULL)
 
 but why? Compiler bug (gcc)?

Given this fragment is #ifdef DEBUG it looks like rump_server has to
be linked with librumpvfs in the DEBUG case?

Antti?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: DoS attack against TCP services

2015-02-28 Thread J. Hannken-Illjes

On 28 Feb 2015, at 16:28, Christos Zoulas chris...@zoulas.com wrote:

 On Feb 28, 11:37am, 6b...@6bone.informatik.uni-leipzig.de 
 (6b...@6bone.informatik.uni-leipzig.de) wrote:
 -- Subject: Re: DoS attack against TCP services
 
 | On Fri, 13 Feb 2015, Christos Zoulas wrote:
 | 
 |  I tried adding show callout to crash(8) but it is not useful because the
 |  pointers move too quickly. OTOH, next time this happens you can enter ddb
 |  on your machine and type show callout and see if that sheds any light
 |  to the expired and not fired callouts...
 | 
 |  christos
 | 
 | 
 | The problem occurred again. I have created a couple of screenshots. 
 | Unfortunately I can not interpret the output.
 | 
 | https://www.ipv6.uni-leipzig.de/callout_1.png
 | https://www.ipv6.uni-leipzig.de/callout_2.png
 | https://www.ipv6.uni-leipzig.de/callout_3.png
 | https://www.ipv6.uni-leipzig.de/callout_4.png
 | https://www.ipv6.uni-leipzig.de/callout_x.png
 | 
 | 
 | Thank your for your efforts
 
 So all the timeouts have expired and are not firing anymore (negative times).
 This would indicate something broken with interrupts... Let me see where we
 can add some debugging...

Anyone holding proc_lock?  I had a similar problem with fstrans where
it was a deadlock with proc_lock preventing timer_intr() to succeed and
therefore all timers stopped working.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: DoS attack against TCP services

2015-02-28 Thread J. Hannken-Illjes
On 28 Feb 2015, at 18:20, 6b...@6bone.informatik.uni-leipzig.de wrote:

 On Sat, 28 Feb 2015, Christos Zoulas wrote:
 
 Good idea. You can use crash, ps and see what each process is holding...
 
 christos
 
 Here the output from crash and ps
 
 gate# crash
 Crash version 7.0_BETA, image version 7.99.5.
 WARNING: versions differ, you may not be able to examine this image.
 Output from a running system is unreliable.
 crash ps
 PIDLID S CPU FLAGS   STRUCT LWP *   NAME WAIT
snip
 05 3   0   200   fe882df11860  softclk/0 tstile

This one looks bad.  Which thread holds proc_lock?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: DoS attack against TCP services

2015-02-28 Thread J. Hannken-Illjes
On 28 Feb 2015, at 19:39, 6b...@6bone.informatik.uni-leipzig.de wrote:

 On Sat, 28 Feb 2015, J. Hannken-Illjes wrote:
 
 This one looks bad.  Which thread holds proc_lock?
 
 
 Helps this?
 
 https://www.ipv6.uni-leipzig.de/proc_lock.png

Looks unlocked -- what about a backtrace of thread 0.5,
bt /a 0xfe882df11860

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: FUSE crashes on i386 and amd64

2014-10-31 Thread J. Hannken-Illjes
On 31 Oct 2014, at 17:09, Tom Ivar Helbekkmo t...@hamartun.priv.no wrote:

 I'm experimenting with MooseFS on NetBSD/i386 and /amd64, current as of
 September 9th.

Please update -- hopefully fixed with Rev. 1.34 of fs/puffs/puffs_node.c

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Testing 7.0 Beta: FFS still very slow when creating files

2014-08-25 Thread J. Hannken-Illjes
On 25 Aug 2014, at 17:39, Taylor R Campbell riastr...@netbsd.org wrote:

   Date: Mon, 25 Aug 2014 15:55:53 +0200
   From: J. Hannken-Illjes hann...@eis.cs.tu-bs.de
 
   GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern.
 
   See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16.
 
   Could this be the cause of this big loss in performance?
 
 Shouldn't be too hard to test this.  Perhaps try dropping in the
 following replacements for the vcache key comparison and running the
 test for each one?
 memequal.c

We are talking about a kernel from 2012/09 -- vcache came much later.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Testing 7.0 Beta: FFS still very slow when creating files

2014-08-25 Thread J. Hannken-Illjes
On 25 Aug 2014, at 15:55, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:

 On 24 Aug 2014, at 18:57, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:
 
 snip
 
 I tried to bisect and got an increase in time from ~15 secs to ~24 secs
 between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'.
 
 Someone should redo this test as this interval is the import of the
 compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools.  I cant
 believe this to be a compiler problem.
 
 GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern.
 
 See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16.
 
 Could this be the cause of this big loss in performance?

Short answer: it is -- reverting external/gpl3/gcc/dist/gcc/builtins.c
from Rev. 1.3 to 1.2 brings back the old times which are the same as
they were on NetBSD 6.

Given that this test has many calls to ufs_lookup/cache_lookup using
memcmp to check for equal filenames this is not a surprise.

A rather naive implementation of memcmp (see below) drops the running
time from ~15 sec to ~9 secs.  We should consider improving our memcmp.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Index: libkern.h
===
RCS file: /cvsroot/src/sys/lib/libkern/libkern.h,v
retrieving revision 1.106
diff -p -u -2 -r1.106 libkern.h
--- libkern.h   30 Aug 2012 12:16:49 -  1.106
+++ libkern.h   25 Aug 2014 17:23:35 -
@@ -262,5 +262,18 @@ void   *memset(void *, int, size_t);
 #if __GNUC_PREREQ__(2, 95)  !defined(_STANDALONE)
 #definememcpy(d, s, l) __builtin_memcpy(d, s, l)
-#definememcmp(a, b, l) __builtin_memcmp(a, b, l)
+static inline int __memcmp(const void *a, const void *b, size_t l)
+{
+   const unsigned char *pa = a, *pb = b;
+
+   if (l  8)
+   return memcmp(a, b, l);
+   while (l--  0) {
+   if (__predict_false(*pa != *pb))
+   return *pa  *pb ? -1 : 1;
+   pa++; pb++;
+   }
+   return 0;
+}
+#definememcmp(a, b, l) __memcmp(a, b, l)
 #endif
 #if __GNUC_PREREQ__(2, 95)  !defined(_STANDALONE)



Re: Testing 7.0 Beta: FFS still very slow when creating files

2014-08-24 Thread J. Hannken-Illjes
On 22 Aug 2014, at 18:29, Taylor R Campbell riastr...@netbsd.org wrote:

   Date: Fri, 22 Aug 2014 17:59:37 +0200
   From: Stephan stephan...@googlemail.com
 
   Has anybody an idea on this or how to track this down? At the moment,
   I can't even enter ddb using Strg+Alt+Esc keys for some reason. I've
   also seen people playing with dtrace but that doesn't seem to be
   included.
 
 Dtrace may be a good idea.  You can use it by
 
 (a) using a kernel built with `options KDTRACE_HOOKS',
 (b) using a userland built with MKDTRACE=yes,
 (c) modload /stand/ARCH/VERSION/solaris.kmod
modload /stand/ARCH/VERSION/dtrace.kmod
modload /stand/ARCH/VERSION/fbt.kmod
modload /stand/ARCH/VERSION/sdt.kmod
 (d) mkdir /dev/dtrace  mknod /dev/dtrace/dtrace c dtrace
 
 (Yes, this is too much work.  Someone^TM should turn it all on by
 default for netbsd-7...!)
 
 From the lockstat output it looks like there's a lot of use of
 mntvnode_lock, which suggests this may be related to hannken@'s vnode
 cache changes.  Might be worthwhile to sample stack traces of
 vfs_insmntque, with something like
 
 dtrace -n 'fbt::vfs_insmntqueue:entry { @[stack()]++ }'
 
 or perhaps sample stack traces of the mutex_enters of mntvnode_lock.

This was my first guess too ...

I tried to bisect and got an increase in time from ~15 secs to ~24 secs
between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'.

Someone should redo this test as this interval is the import of the
compiler (GCC 4.5.3 - 4.5.4) and I had to rebuild tools.  I cant
believe this to be a compiler problem.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Btw.: my test script is:

#! /bin/sh

mdconfig /dev/md0d 2048000 
P=${!}
newfs /dev/rmd0a
mount -t ffs -o log /dev/md0a /mnt

(cd /mnt  time sh -c 'seq 1 3|xargs touch')

umount /mnt
kill ${P}


RiscOS FILECORE disk image needed

2014-08-18 Thread J. Hannken-Illjes
Subject says it all:

I'm looking for a RiscOS FILECORE disk image that is mountable and
readable on NetBSD with

vnconfig -rc vnd0 image
mount -r -t filecore /dev/vnd0d /mnt
ls -la /mnt
...

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: tmpfs lock error panic with mknod+S_IFMT

2014-05-26 Thread J. Hannken-Illjes
On 26 May 2014, at 16:25, Nicolas Joly nj...@pasteur.fr wrote:

 
 Hi,
 
 While testing some linux binary, i encountered a reproductible lock
 error when the program issued the following mknod call on a tmpfs
 mount :
 
   mknod(dummy, S_IFMT|0666, 0);

You try to create a bad sector on tmpfs -- see do_sys_mknodat() for
details.  More analysis when cvs.netbsd.org is back again.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: i386 ddb trace stopped working with gcc48

2014-04-22 Thread J. Hannken-Illjes
On 21 Apr 2014, at 19:34, Christos Zoulas chris...@astron.com wrote:

 In article 910f0bed-9fd7-4c06-b886-e91002d03...@eis.cs.tu-bs.de,
 J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:
 On 21 Apr 2014, at 14:39, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:
 
 Since i386 switched to gcc48 ddb trace no longer works:
 
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 eip c02802f4 cs 8 eflags 202 cr2 bbbab0c4 ilevel 8 esp 
 800
 curlwp 0xc5a9fd20 pid 0 lid 2 lowest kstack 0xdd3b22c0
 Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x4:  popl%ebp
 db{0} bt
 
 breakpoint(c0e661c0,3f8,0,0,c61c5158,c170dacc,c6188000,c5f396c0,c5f39748,dd3b4edc)
  at netbsd:breakpoint+0x4
 
 Thats all, never get more than one line.
 The i386_frame from %ebp = dd25ef30 looks like:
 
 dd25ef30:   7ff = should be the previous frame
 dd25ef34:   c0277cc1= comintr+0x53e (caller of breakpoint)
 dd25ef38:   c0e661c0
 
 Ideas anyone?
 
 Some further notes:
 
 - The function prologue has changed as
 
  -push   %ebp
  -mov%esp,%ebp
   sub$0x14,%esp
 
   call   ...
  -leave  
  +add$0x14,%esp
   ret
 
 - With -fno-omit-frame-pointer all is well.
 
 Perhaps the default changes to -fomit-frame-pointer... We should consider
 changing it back like we did for amd64.

Do we really want to add

  makeoptions COPTS=... -fno-omit-frame-pointer

to all i386 kernel configs like amd64 does or should it better
go to sys/arch/i386/conf/Makefile.i386?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



i386 ddb trace stopped working with gcc48

2014-04-21 Thread J. Hannken-Illjes
Since i386 switched to gcc48 ddb trace no longer works:

fatal breakpoint trap in supervisor mode
trap type 1 code 0 eip c02802f4 cs 8 eflags 202 cr2 bbbab0c4 ilevel 8 esp 800
curlwp 0xc5a9fd20 pid 0 lid 2 lowest kstack 0xdd3b22c0
Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x4:  popl%ebp
db{0} bt
breakpoint(c0e661c0,3f8,0,0,c61c5158,c170dacc,c6188000,c5f396c0,c5f39748,dd3b4edc)
 at netbsd:breakpoint+0x4

Thats all, never get more than one line.
The i386_frame from %ebp = dd25ef30 looks like:

dd25ef30:   7ff = should be the previous frame
dd25ef34:   c0277cc1= comintr+0x53e (caller of breakpoint)
dd25ef38:   c0e661c0

Ideas anyone?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: i386 ddb trace stopped working with gcc48

2014-04-21 Thread J. Hannken-Illjes
On 21 Apr 2014, at 14:39, J. Hannken-Illjes hann...@eis.cs.tu-bs.de wrote:

 Since i386 switched to gcc48 ddb trace no longer works:
 
 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 eip c02802f4 cs 8 eflags 202 cr2 bbbab0c4 ilevel 8 esp 800
 curlwp 0xc5a9fd20 pid 0 lid 2 lowest kstack 0xdd3b22c0
 Stopped in pid 0.2 (system) at  netbsd:breakpoint+0x4:  popl%ebp
 db{0} bt
 breakpoint(c0e661c0,3f8,0,0,c61c5158,c170dacc,c6188000,c5f396c0,c5f39748,dd3b4edc)
  at netbsd:breakpoint+0x4
 
 Thats all, never get more than one line.
 The i386_frame from %ebp = dd25ef30 looks like:
 
 dd25ef30:   7ff   = should be the previous frame
 dd25ef34:   c0277cc1  = comintr+0x53e (caller of breakpoint)
 dd25ef38:   c0e661c0
 
 Ideas anyone?

Some further notes:

- The function prologue has changed as

-push   %ebp
-mov%esp,%ebp
 sub$0x14,%esp

 call   ...
-leave  
+add$0x14,%esp
 ret

- With -fno-omit-frame-pointer all is well.

Does it ring any bell?

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Cannot execute elf binary ...

2013-12-21 Thread J. Hannken-Illjes
On Dec 21, 2013, at 6:21 PM, Kurt Schreiner k...@ub.uni-mainz.de wrote:

 a kernel compiled from source cvs updated some minutes ago can't exec elf
 binaries anymore; seen on i386 and arm, screenshot of i386-VM attached.


Just revert src/sys/kern/exec_elf.c to 1.51.

Running strnlen() on a fresh allocated memory region looks strange.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)