daily CVS update output

2023-11-01 Thread NetBSD source update


Updating src tree:
P src/distrib/sets/lists/dtb/ad.earmv7
P src/distrib/sets/lists/dtb/ad.earmv7eb
P src/doc/3RDPARTY
P src/sys/dev/ic/dwc_eqos.c

Updating xsrc tree:


Killing core files:




Updating file list:
-rw-rw-r--  1 srcmastr  netbsd  42895853 Nov  2 03:03 ls-lRA.gz


rge(4) completely hangs

2023-11-01 Thread Thomas Klausner
Hi!

After the latest fixes, rge(4) is better, but it's completely hung up
the network interface twice so far - no network traffic possible on it
- both times so hard, that the BIOS had some kind of issue on the next
boot and needed 15 minutes to sort itself out (before even showing
anything on the screen).

I'm running a kernel from Oct 22.

In /var/log/messages I see:
Nov  1 18:59:43 exadelic dhcpcd[2191]: rge0: ::1 is unreachable
Nov  1 18:59:43 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router
Nov  1 18:59:45 exadelic dhcpcd[2191]: rge0: Router Advertisement from ::1  
1 18:59:46 exadelic dhcpcd[2191]: rge0: ::1 is unreachable
Nov  1 18:59:46 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router
Nov  1 18:59:58 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available
Nov  1 19:01:11 exadelic dhcpcd[2191]: rge0: ::1 is reachable again
Nov  1 19:01:19 exadelic dhcpcd[2191]: rge0: ::1 is unreachable
Nov  1 19:01:19 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router
Nov  1 19:01:31 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available
Nov  1 19:01:57 exadelic dhcpcd[2191]: rge0: ::1 is reachable again
Nov  1 19:02:05 exadelic dhcpcd[2191]: rge0: ::1 is unreachable
Nov  1 19:02:05 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router
Nov  1 19:02:17 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available
Nov  1 19:04:27 exadelic dhcpcd[2191]: rge0: ::1 is reachable again
Nov  1 19:04:35 exadelic dhcpcd[2191]: rge0: ::1 is unreachable
Nov  1 19:04:35 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router
Nov  1 19:04:47 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available
Nov  1 19:06:12 exadelic dhcpcd[2191]: rge0: ::1 is reachable again
Nov  1 19:06:20 exadelic dhcpcd[2191]: rge0: ::1 is unreachable
Nov  1 19:06:21 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router
Nov  1 19:06:33 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available
Nov  1 19:09:27 exadelic /netbsd: [ 91537.5847758] nfs server 
192.168.178.19:/path: not responding
Nov  1 19:15:16 exadelic dhcpcd[2191]: rge0: ::1 is reachable again
Nov  1 19:15:24 exadelic dhcpcd[2191]: rge0: ::1 is unreachable
Nov  1 19:15:24 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router
Nov  1 19:15:36 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available
Nov  1 19:16:51 exadelic dhcpcd[2191]: rge0: ::1 is reachable again
Nov  1 19:16:52 exadelic dhcpcd[2191]: ps_root_recvmsg: No buffer space 
available
Nov  1 19:16:52 exadelic dhcpcd[2191]: ps_root_recvmsg: No buffer space 
available
Nov  1 19:16:59 exadelic dhcpcd[2191]: rge0: ::1 is unreachable
Nov  1 19:16:59 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router
Nov  1 19:16:59 exadelic dhcpcd[2191]: ps_root_recvmsg: No buffer space 
available
Nov  1 19:17:11 exadelic syslogd[2290]: last message repeated 3 times
Nov  1 19:17:11 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available
Nov  1 19:17:44 exadelic dhcpcd[2191]: ps_root_recvmsg: No buffer space 
available

Just in case it matters, I'm not running with default sysctl's, I have

kern.sbmax: 262144 -> 16777216
net.inet.tcp.recvbuf_max: 262144 -> 16777216
net.inet.tcp.sendbuf_max: 262144 -> 16777216
net.inet.tcp.recvspace: 32768 -> 262144
net.inet.tcp.sendspace: 32768 -> 262144

because of

https://mail-index.netbsd.org/current-users/2017/09/21/msg032369.html

I've now switched to an ure(4) device.

Has anyone else seen this?
 Thomas


Re: file-backed cgd backup question

2023-11-01 Thread Greg Troxel
Paul Ripke  writes:

>> #!/bin/sh
>> 
>> dd if=/dev/zero of=VND bs=1m count=1
>> cat VND > VND.000
>> vnconfig vnd0 VND
>> cat VND > VND.001
>> newfs /dev/rvnd0a
>> cat VND > VND.002
>> vnconfig -u vnd0
>> cat VND > VND.003
>
> At least this DTRT:
>
> dd if=VND of=VND.004 iflag=direct

That (and thanks for the AIX note) is interesting, but I don't see that
it makes backup programs work right.

It seems that the current man page caution about consistency is correct,
and that it would be great if someone added a cache invalidate on close.
(It's a little scary to touch this code in terms of how much it could
mess people up.)


Re: random lockups (now suspecting zfs)

2023-11-01 Thread Greg Troxel
Paul Ripke  writes:

>  Fri, Oct 20, 2023 at 01:11:15PM -0400, Greg Troxel wrote:
>> A different machine has locked up, running recent netbsd-10.  I was
>> doing pkgsrc rebuilds in zfs, in a dom0 with 4G of RAM, with 8G total
>> physical.  It has a private patch to reduce the amount of memory used
>> for ARC, which has been working well.

I have had an additional lockup each on my main machine and my xen/pkg
machine.

>> All 3 tmux windows show something like
>> 
>>   [ 373598.5266510] load: 0.00  cmd: bash 21965 [flt_noram5] 0.37u 2.89s 0% 
>> 6396k
>> 
>> and I can switch among them and ^T, but trying to run top is stuck (in
>> flt_noram5).  I'll give it an hour or so, and have a look at the
>> console.
>
> Curious - do you have swap configured? On what kind of device?
> I'm wondering if a pageout is wedged waiting for memory...

I do have swap configured.

  wd0 at atabus0 drive 0
  wd0: 
  wd0: drive supports 1-sector PIO transfers, LBA48 addressing
  wd0: 3726 GB, 7752021 cyl, 16 head, 63 sec, 512 bytes/sect x 7814037168 
sectors
  wd0: GPT GUID: 7f026840-bd44-4063-be7c-647727ac10d6
  dk2 at wd0: "GDT-3276-4/swap", 83886080 blocks at 4458496, type: swap
  root on dk1 dumps on dk2
  Device  1024-blocks UsedAvail Capacity  Priority
  /dev/dk2   419430400 41943040 0%0

  wd0 at atabus0 drive 0
  wd0: 
  wd0: drive supports 1-sector PIO transfers, LBA48 addressing
  wd0: 953 GB, 1984533 cyl, 16 head, 63 sec, 512 bytes/sect x 2000409264 sectors
  Device  1024-blocks UsedAvail Capacity  Priority
  /dev/wd0b  1677765649384 16728272 0%0

The first is a GPT partition mounted by NAME, and the second is a
disklabel partition.  The first machine I don't expect to really swap,
and the second definitely has memory pressure.   Interestingly none of
the xen domUs have locked up, meaning I've never found them wedged and
the dom0 ok.

So to me this feels like a locking botch in a rare path in zfs.




Re: file-backed cgd backup question

2023-11-01 Thread Paul Ripke
On Sun, Oct 22, 2023 at 09:40:53AM -0400, Greg Troxel wrote:
> mlel...@serpens.de (Michael van Elst) writes:
> 
> > g...@lexort.com (Greg Troxel) writes:
> >
> >>> vnd opens the backing file when the unit is created and closes
> >>> the backing file when the unit is destroyed. Then you can access
> >>> the file again.
> >
> >>Is there a guarantee of cache consistency for writes before and reads
> >>after?
> >
> > Before the unit is created you can access the file and after the
> > unit is destroyed you can access the file. That's always safe.
> 
> Sorry if I'm failing to understand something obvious, but with a caching
> layer that has file contents, how are the cache contents invalidated?
> 
> Specifically (but loosely in commands)
> 
>   let's assume the vnd is small and there is a lot of RAM available
> 
>   process opens the file and reads it
> 
>   vnconfig
> 
>   mount vnd0 /mnt
> 
>   date > /mnt/somefile
> 
>   umount /mnt
> 
>   vnconfig -u
> 
>   process opens the file and reads it
> 
> Without fs cache invalidation, stale data can be returned.
> 
> If there is explicit invalidation, it would be nice to say that
> precisely but I am not understanding that it is there.  Reading vnd.c, I
> don't see any cache invalidation on detach.   The only explicit
> invalidation I find is in setcred from VNDIOCSET.
> 
> I guess that prevents the above, but doesn't prevent
> 
>   vnconfig
> 
>   mount
> 
>   read backing file
> 
>   write to mount
> 
>   unmount
> 
>   detach
> 
>   read backing file
> 
> so maybe we need a vinvalbuf on detach?
> 
> > I also think that when the unit is configured but not opened
> > (by device access or mounts) it is safe to access the file.
> 
> As I read the code, reads are ok but will leave possibly stale data in
> the cache for post-close.
> 
> >>> The data is written directly to the allocated blocks of the file.
> >>> So exclusively opening  the backing file _or_ the vnd unit should
> >>> also be safe. But that's not much different from accessing any file
> >>> concurrently, which also leads to "corrupt", inconsistent backups.
> >
> >>That's a different kind of corrupt.
> >
> > Yes, but in the end it's the same, the "backup" isn't usuable.
> 
> I am expecting that after deconfiguring, a read of the entire file is
> guaranteed consistent, but I think we're missing invalidate on close.
> 
> > You cannot access the backing file to get a consistent state of the
> > data while a unit is in use. And that's independent of how vnd accesses
> > the bits.
> 
> Agreed; that's more or less like using a backup program on database
> files while the database is running.
> 
> > N.B. if you want to talk about dangers, think about fdiscard(). I
> > doubt that it is safe in the context of the vnd optimization.
> 
> It seems clear that pretty much any file operations are unsafe while the
> vnd is configured.  That seems like an entirely reasonable situation and
> if that's the rule, easy to document.
> 
> I wrote a test script and it shows that stale reads happen.  When I run
> this on UFS2 (netbsd-10), I find that all 4 files are all zero.  When I
> run it on zfs (also netbsd-10), I find that 000 and 001 are all zero and
> 002 and 003 are the same.  (I am guessing that zfs doesn't use the
> direct operations, or caches differently; here I haven't the slightest
> idea what is happening.)
> 
> 10 minutes later, reading VND is still all zeros.  With a new vnconfig,
> it still reads as all zeros.
> 
> 
> #!/bin/sh
> 
> dd if=/dev/zero of=VND bs=1m count=1
> cat VND > VND.000
> vnconfig vnd0 VND
> cat VND > VND.001
> newfs /dev/rvnd0a
> cat VND > VND.002
> vnconfig -u vnd0
> cat VND > VND.003

At least this DTRT:

dd if=VND of=VND.004 iflag=direct

This reminds me of the way AIX handled O_DIRECT vs mmap, etc. From
https://www.ibm.com/docs/en/aix/7.2?topic=tuning-direct-io:

 To avoid consistency issues, if there are multiple calls to open a file
 and one or more of the calls did not specify O_DIRECT and another open
 specified O_DIRECT, the file stays in the normal cached I/O mode.
 Similarly, if the file is mapped into memory through the shmat() or
 mmap() system calls, it stays in normal cached mode. If the last
 conflicting, non-direct access is eliminated, then the file system will
 move the file into direct I/O mode (either by using the close(),
 munmap(), or shmdt() subroutines). Changing from normal mode to direct
 I/O mode can be expensive because all modified pages in memory will have
 to be flushed to disk at that point.

An elegant, albeit complex, solution.

-- 
Paul Ripke
"Great minds discuss ideas, average minds discuss events, small minds
 discuss people."
-- Disputed: Often attributed to Eleanor Roosevelt. 1948.


Re: random lockups

2023-11-01 Thread Paul Ripke
On Fri, Oct 20, 2023 at 01:11:15PM -0400, Greg Troxel wrote:
> A different machine has locked up, running recent netbsd-10.  I was
> doing pkgsrc rebuilds in zfs, in a dom0 with 4G of RAM, with 8G total
> physical.  It has a private patch to reduce the amount of memory used
> for ARC, which has been working well.
> 
> All 3 tmux windows show something like
> 
>   [ 373598.5266510] load: 0.00  cmd: bash 21965 [flt_noram5] 0.37u 2.89s 0% 
> 6396k
> 
> and I can switch among them and ^T, but trying to run top is stuck (in
> flt_noram5).  I'll give it an hour or so, and have a look at the
> console.

Curious - do you have swap configured? On what kind of device?
I'm wondering if a pageout is wedged waiting for memory...

-- 
Paul Ripke
"Great minds discuss ideas, average minds discuss events, small minds
 discuss people."
-- Disputed: Often attributed to Eleanor Roosevelt. 1948.


Re: weird hangs in current (ghc, gnucash)

2023-11-01 Thread Thomas Klausner
Should we back out ad's changes until he has time to look at them?
 Thomas


On Wed, Nov 01, 2023 at 09:36:01AM +, Chavdar Ivanov wrote:
> This weird hang still takes place on
> 
> ❯ uname -a
> NetBSD ymir.lorien.lan 10.99.10 NetBSD 10.99.10 (GENERIC) #13: Mon Oct
> 30 19:45:39 GMT 2023
> sysbu...@ymir.lorien.lan:/dumps/sysbuild/amd64/obj/home/sysbuild/src/sys/arch/amd64/com
> pile/GENERIC amd64
> 
> - again during building a haskell package:
> 
> ===> Configuring for hs-tagged-0.8.8
> [1 of 2] Compiling Main ( Setup.lhs, Setup.o )
> 
> 
> Htop gives weird output for the process not-yet-created:
> 
> 11506 root63   0 33283   873 S   0.0  0.0  0:00.00 |  `- make
> 20458 root62   0 34832   613 S   0.0  0.0  0:00.00 |  `-
> /bin/sh -c set -e; test -n "" && echo 1>&2 "ERROR:"  && exit
> 1;  exec 3<&0;??? whil
> 24942 root63   0 33296   882 S   0.0  0.0  0:00.00 |  `-
> /usr/bin/make _MAKE=/usr/bin/make OPSYS=NetBSD OS_VERSION=10.99.10
> OPSYS_VERSION=109910 LOWE
> 21643 root58   0 34302   606 S   0.0  0.0  0:00.00 |  `-
> /bin/sh -c set -e;? if test -n "" &&  /usr/pkg/sbin/pkg_info -K
> /usr/pkg/pkgdb -qe hs
> 19149 root63   0 34367   920 S   0.0  0.0  0:00.00 |  `-
> /usr/bin/make LOWER_OPSYS=netbsd _PKGSRC_BARRIER=yes
> ALLOW_VULNERABLE_PACKAGES= reinst
> 23303 root58   0 33685   603 S   0.0  0.0  0:00.00 |   `-
> /bin/sh -c set -e; ulimit -d `ulimit -H -d`; ulimit -v `ulimit -H -v`;
> cd /usr/pkgs
> 27078 root21   0  256G 37735 S   0.0  0.9  0:00.00 | `-
> /usr/pkg/lib/ghc-9.6.3/bin/./ghc-9.6.3 -B/usr/pkg/lib/ghc-9.6.3/lib
> -package-env
> 22058 root   -22   0 0 0 Z   0.0  0.0  0:00.00 |
> `- gcc   <==
> ---
> 
> 
> I guess it is back to the kernel from the 9th of October.
> 
> Chavdar
> 
> -
> 
> On Mon, 23 Oct 2023 at 09:27, Chavdar Ivanov  wrote:
> >
> > I can confirm that after reverting to the kernel from 9th of October 
> > devel/happy builds OK.
> >
> > On Mon, 23 Oct 2023 at 05:56, Markus Kilbinger  wrote:
> >>
> >> ... and probably
> >>
> >> 3. PR kern/57660
> >> https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=57660
> >>
> >> Markus
> >>
> >> Am So., 22. Okt. 2023 um 23:10 Uhr schrieb Thomas Klausner 
> >> :
> >> >
> >> > On Sun, Oct 22, 2023 at 11:06:25PM +0200, Thomas Klausner wrote:
> >> > > On Sun, Oct 22, 2023 at 10:37:54PM +0200, Thomas Klausner wrote:
> >> > > > I've just updated my kernel from 10.99.10 to 10.99.10 (~ Oct 11 to 
> >> > > > Oct
> >> > > > 20) to test the rge(4) changes, and started a bulk build, and the
> >> > > > packages using ghc seem to wait for something and make no progress.
> >> > > ...
> >> > > > I see one other new weird behaviour on that machine - gnucash doesn't
> >> > > > finish starting up.
> >> > >
> >> > > I've backed out ad's changes from the 13th, and both problems are gone.
> >> > >
> >> > > I'll attach my local change.
> >> > >
> >> > > Andrew, can you please take a look?
> >> >
> >> > Two test cases to see the problem I have:
> >> >
> >> > 1. start gnucash, it doesn't finish starting up, the splash screen hangs.
> >> >
> >> > 2. cd /usr/pkgsrc/devel/hs-data-array-byte && make
> >> >The 'build' step has two parts, it hangs after the first one.
> >> >
> >> >  Thomas
> >
> >
> >
> > --
> > 
> 
> 
> 
> -- 
> 


Re: weird hangs in current (ghc, gnucash)

2023-11-01 Thread Chavdar Ivanov
This weird hang still takes place on

❯ uname -a
NetBSD ymir.lorien.lan 10.99.10 NetBSD 10.99.10 (GENERIC) #13: Mon Oct
30 19:45:39 GMT 2023
sysbu...@ymir.lorien.lan:/dumps/sysbuild/amd64/obj/home/sysbuild/src/sys/arch/amd64/com
pile/GENERIC amd64

- again during building a haskell package:

===> Configuring for hs-tagged-0.8.8
[1 of 2] Compiling Main ( Setup.lhs, Setup.o )


Htop gives weird output for the process not-yet-created:

11506 root63   0 33283   873 S   0.0  0.0  0:00.00 |  `- make
20458 root62   0 34832   613 S   0.0  0.0  0:00.00 |  `-
/bin/sh -c set -e; test -n "" && echo 1>&2 "ERROR:"  && exit
1;  exec 3<&0;??? whil
24942 root63   0 33296   882 S   0.0  0.0  0:00.00 |  `-
/usr/bin/make _MAKE=/usr/bin/make OPSYS=NetBSD OS_VERSION=10.99.10
OPSYS_VERSION=109910 LOWE
21643 root58   0 34302   606 S   0.0  0.0  0:00.00 |  `-
/bin/sh -c set -e;? if test -n "" &&  /usr/pkg/sbin/pkg_info -K
/usr/pkg/pkgdb -qe hs
19149 root63   0 34367   920 S   0.0  0.0  0:00.00 |  `-
/usr/bin/make LOWER_OPSYS=netbsd _PKGSRC_BARRIER=yes
ALLOW_VULNERABLE_PACKAGES= reinst
23303 root58   0 33685   603 S   0.0  0.0  0:00.00 |   `-
/bin/sh -c set -e; ulimit -d `ulimit -H -d`; ulimit -v `ulimit -H -v`;
cd /usr/pkgs
27078 root21   0  256G 37735 S   0.0  0.9  0:00.00 | `-
/usr/pkg/lib/ghc-9.6.3/bin/./ghc-9.6.3 -B/usr/pkg/lib/ghc-9.6.3/lib
-package-env
22058 root   -22   0 0 0 Z   0.0  0.0  0:00.00 |
`- gcc   <==
---


I guess it is back to the kernel from the 9th of October.

Chavdar

-

On Mon, 23 Oct 2023 at 09:27, Chavdar Ivanov  wrote:
>
> I can confirm that after reverting to the kernel from 9th of October 
> devel/happy builds OK.
>
> On Mon, 23 Oct 2023 at 05:56, Markus Kilbinger  wrote:
>>
>> ... and probably
>>
>> 3. PR kern/57660
>> https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=57660
>>
>> Markus
>>
>> Am So., 22. Okt. 2023 um 23:10 Uhr schrieb Thomas Klausner :
>> >
>> > On Sun, Oct 22, 2023 at 11:06:25PM +0200, Thomas Klausner wrote:
>> > > On Sun, Oct 22, 2023 at 10:37:54PM +0200, Thomas Klausner wrote:
>> > > > I've just updated my kernel from 10.99.10 to 10.99.10 (~ Oct 11 to Oct
>> > > > 20) to test the rge(4) changes, and started a bulk build, and the
>> > > > packages using ghc seem to wait for something and make no progress.
>> > > ...
>> > > > I see one other new weird behaviour on that machine - gnucash doesn't
>> > > > finish starting up.
>> > >
>> > > I've backed out ad's changes from the 13th, and both problems are gone.
>> > >
>> > > I'll attach my local change.
>> > >
>> > > Andrew, can you please take a look?
>> >
>> > Two test cases to see the problem I have:
>> >
>> > 1. start gnucash, it doesn't finish starting up, the splash screen hangs.
>> >
>> > 2. cd /usr/pkgsrc/devel/hs-data-array-byte && make
>> >The 'build' step has two parts, it hangs after the first one.
>> >
>> >  Thomas
>
>
>
> --
> 



--