daily CVS update output
Updating src tree: P src/distrib/sets/lists/dtb/ad.earmv7 P src/distrib/sets/lists/dtb/ad.earmv7eb P src/doc/3RDPARTY P src/sys/dev/ic/dwc_eqos.c Updating xsrc tree: Killing core files: Updating file list: -rw-rw-r-- 1 srcmastr netbsd 42895853 Nov 2 03:03 ls-lRA.gz
rge(4) completely hangs
Hi! After the latest fixes, rge(4) is better, but it's completely hung up the network interface twice so far - no network traffic possible on it - both times so hard, that the BIOS had some kind of issue on the next boot and needed 15 minutes to sort itself out (before even showing anything on the screen). I'm running a kernel from Oct 22. In /var/log/messages I see: Nov 1 18:59:43 exadelic dhcpcd[2191]: rge0: ::1 is unreachable Nov 1 18:59:43 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router Nov 1 18:59:45 exadelic dhcpcd[2191]: rge0: Router Advertisement from ::1 1 18:59:46 exadelic dhcpcd[2191]: rge0: ::1 is unreachable Nov 1 18:59:46 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router Nov 1 18:59:58 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available Nov 1 19:01:11 exadelic dhcpcd[2191]: rge0: ::1 is reachable again Nov 1 19:01:19 exadelic dhcpcd[2191]: rge0: ::1 is unreachable Nov 1 19:01:19 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router Nov 1 19:01:31 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available Nov 1 19:01:57 exadelic dhcpcd[2191]: rge0: ::1 is reachable again Nov 1 19:02:05 exadelic dhcpcd[2191]: rge0: ::1 is unreachable Nov 1 19:02:05 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router Nov 1 19:02:17 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available Nov 1 19:04:27 exadelic dhcpcd[2191]: rge0: ::1 is reachable again Nov 1 19:04:35 exadelic dhcpcd[2191]: rge0: ::1 is unreachable Nov 1 19:04:35 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router Nov 1 19:04:47 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available Nov 1 19:06:12 exadelic dhcpcd[2191]: rge0: ::1 is reachable again Nov 1 19:06:20 exadelic dhcpcd[2191]: rge0: ::1 is unreachable Nov 1 19:06:21 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router Nov 1 19:06:33 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available Nov 1 19:09:27 exadelic /netbsd: [ 91537.5847758] nfs server 192.168.178.19:/path: not responding Nov 1 19:15:16 exadelic dhcpcd[2191]: rge0: ::1 is reachable again Nov 1 19:15:24 exadelic dhcpcd[2191]: rge0: ::1 is unreachable Nov 1 19:15:24 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router Nov 1 19:15:36 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available Nov 1 19:16:51 exadelic dhcpcd[2191]: rge0: ::1 is reachable again Nov 1 19:16:52 exadelic dhcpcd[2191]: ps_root_recvmsg: No buffer space available Nov 1 19:16:52 exadelic dhcpcd[2191]: ps_root_recvmsg: No buffer space available Nov 1 19:16:59 exadelic dhcpcd[2191]: rge0: ::1 is unreachable Nov 1 19:16:59 exadelic dhcpcd[2191]: rge0: soliciting an IPv6 router Nov 1 19:16:59 exadelic dhcpcd[2191]: ps_root_recvmsg: No buffer space available Nov 1 19:17:11 exadelic syslogd[2290]: last message repeated 3 times Nov 1 19:17:11 exadelic dhcpcd[2191]: rge0: no IPv6 Routers available Nov 1 19:17:44 exadelic dhcpcd[2191]: ps_root_recvmsg: No buffer space available Just in case it matters, I'm not running with default sysctl's, I have kern.sbmax: 262144 -> 16777216 net.inet.tcp.recvbuf_max: 262144 -> 16777216 net.inet.tcp.sendbuf_max: 262144 -> 16777216 net.inet.tcp.recvspace: 32768 -> 262144 net.inet.tcp.sendspace: 32768 -> 262144 because of https://mail-index.netbsd.org/current-users/2017/09/21/msg032369.html I've now switched to an ure(4) device. Has anyone else seen this? Thomas
Re: file-backed cgd backup question
Paul Ripke writes: >> #!/bin/sh >> >> dd if=/dev/zero of=VND bs=1m count=1 >> cat VND > VND.000 >> vnconfig vnd0 VND >> cat VND > VND.001 >> newfs /dev/rvnd0a >> cat VND > VND.002 >> vnconfig -u vnd0 >> cat VND > VND.003 > > At least this DTRT: > > dd if=VND of=VND.004 iflag=direct That (and thanks for the AIX note) is interesting, but I don't see that it makes backup programs work right. It seems that the current man page caution about consistency is correct, and that it would be great if someone added a cache invalidate on close. (It's a little scary to touch this code in terms of how much it could mess people up.)
Re: random lockups (now suspecting zfs)
Paul Ripke writes: > Fri, Oct 20, 2023 at 01:11:15PM -0400, Greg Troxel wrote: >> A different machine has locked up, running recent netbsd-10. I was >> doing pkgsrc rebuilds in zfs, in a dom0 with 4G of RAM, with 8G total >> physical. It has a private patch to reduce the amount of memory used >> for ARC, which has been working well. I have had an additional lockup each on my main machine and my xen/pkg machine. >> All 3 tmux windows show something like >> >> [ 373598.5266510] load: 0.00 cmd: bash 21965 [flt_noram5] 0.37u 2.89s 0% >> 6396k >> >> and I can switch among them and ^T, but trying to run top is stuck (in >> flt_noram5). I'll give it an hour or so, and have a look at the >> console. > > Curious - do you have swap configured? On what kind of device? > I'm wondering if a pageout is wedged waiting for memory... I do have swap configured. wd0 at atabus0 drive 0 wd0: wd0: drive supports 1-sector PIO transfers, LBA48 addressing wd0: 3726 GB, 7752021 cyl, 16 head, 63 sec, 512 bytes/sect x 7814037168 sectors wd0: GPT GUID: 7f026840-bd44-4063-be7c-647727ac10d6 dk2 at wd0: "GDT-3276-4/swap", 83886080 blocks at 4458496, type: swap root on dk1 dumps on dk2 Device 1024-blocks UsedAvail Capacity Priority /dev/dk2 419430400 41943040 0%0 wd0 at atabus0 drive 0 wd0: wd0: drive supports 1-sector PIO transfers, LBA48 addressing wd0: 953 GB, 1984533 cyl, 16 head, 63 sec, 512 bytes/sect x 2000409264 sectors Device 1024-blocks UsedAvail Capacity Priority /dev/wd0b 1677765649384 16728272 0%0 The first is a GPT partition mounted by NAME, and the second is a disklabel partition. The first machine I don't expect to really swap, and the second definitely has memory pressure. Interestingly none of the xen domUs have locked up, meaning I've never found them wedged and the dom0 ok. So to me this feels like a locking botch in a rare path in zfs.
Re: file-backed cgd backup question
On Sun, Oct 22, 2023 at 09:40:53AM -0400, Greg Troxel wrote: > mlel...@serpens.de (Michael van Elst) writes: > > > g...@lexort.com (Greg Troxel) writes: > > > >>> vnd opens the backing file when the unit is created and closes > >>> the backing file when the unit is destroyed. Then you can access > >>> the file again. > > > >>Is there a guarantee of cache consistency for writes before and reads > >>after? > > > > Before the unit is created you can access the file and after the > > unit is destroyed you can access the file. That's always safe. > > Sorry if I'm failing to understand something obvious, but with a caching > layer that has file contents, how are the cache contents invalidated? > > Specifically (but loosely in commands) > > let's assume the vnd is small and there is a lot of RAM available > > process opens the file and reads it > > vnconfig > > mount vnd0 /mnt > > date > /mnt/somefile > > umount /mnt > > vnconfig -u > > process opens the file and reads it > > Without fs cache invalidation, stale data can be returned. > > If there is explicit invalidation, it would be nice to say that > precisely but I am not understanding that it is there. Reading vnd.c, I > don't see any cache invalidation on detach. The only explicit > invalidation I find is in setcred from VNDIOCSET. > > I guess that prevents the above, but doesn't prevent > > vnconfig > > mount > > read backing file > > write to mount > > unmount > > detach > > read backing file > > so maybe we need a vinvalbuf on detach? > > > I also think that when the unit is configured but not opened > > (by device access or mounts) it is safe to access the file. > > As I read the code, reads are ok but will leave possibly stale data in > the cache for post-close. > > >>> The data is written directly to the allocated blocks of the file. > >>> So exclusively opening the backing file _or_ the vnd unit should > >>> also be safe. But that's not much different from accessing any file > >>> concurrently, which also leads to "corrupt", inconsistent backups. > > > >>That's a different kind of corrupt. > > > > Yes, but in the end it's the same, the "backup" isn't usuable. > > I am expecting that after deconfiguring, a read of the entire file is > guaranteed consistent, but I think we're missing invalidate on close. > > > You cannot access the backing file to get a consistent state of the > > data while a unit is in use. And that's independent of how vnd accesses > > the bits. > > Agreed; that's more or less like using a backup program on database > files while the database is running. > > > N.B. if you want to talk about dangers, think about fdiscard(). I > > doubt that it is safe in the context of the vnd optimization. > > It seems clear that pretty much any file operations are unsafe while the > vnd is configured. That seems like an entirely reasonable situation and > if that's the rule, easy to document. > > I wrote a test script and it shows that stale reads happen. When I run > this on UFS2 (netbsd-10), I find that all 4 files are all zero. When I > run it on zfs (also netbsd-10), I find that 000 and 001 are all zero and > 002 and 003 are the same. (I am guessing that zfs doesn't use the > direct operations, or caches differently; here I haven't the slightest > idea what is happening.) > > 10 minutes later, reading VND is still all zeros. With a new vnconfig, > it still reads as all zeros. > > > #!/bin/sh > > dd if=/dev/zero of=VND bs=1m count=1 > cat VND > VND.000 > vnconfig vnd0 VND > cat VND > VND.001 > newfs /dev/rvnd0a > cat VND > VND.002 > vnconfig -u vnd0 > cat VND > VND.003 At least this DTRT: dd if=VND of=VND.004 iflag=direct This reminds me of the way AIX handled O_DIRECT vs mmap, etc. From https://www.ibm.com/docs/en/aix/7.2?topic=tuning-direct-io: To avoid consistency issues, if there are multiple calls to open a file and one or more of the calls did not specify O_DIRECT and another open specified O_DIRECT, the file stays in the normal cached I/O mode. Similarly, if the file is mapped into memory through the shmat() or mmap() system calls, it stays in normal cached mode. If the last conflicting, non-direct access is eliminated, then the file system will move the file into direct I/O mode (either by using the close(), munmap(), or shmdt() subroutines). Changing from normal mode to direct I/O mode can be expensive because all modified pages in memory will have to be flushed to disk at that point. An elegant, albeit complex, solution. -- Paul Ripke "Great minds discuss ideas, average minds discuss events, small minds discuss people." -- Disputed: Often attributed to Eleanor Roosevelt. 1948.
Re: random lockups
On Fri, Oct 20, 2023 at 01:11:15PM -0400, Greg Troxel wrote: > A different machine has locked up, running recent netbsd-10. I was > doing pkgsrc rebuilds in zfs, in a dom0 with 4G of RAM, with 8G total > physical. It has a private patch to reduce the amount of memory used > for ARC, which has been working well. > > All 3 tmux windows show something like > > [ 373598.5266510] load: 0.00 cmd: bash 21965 [flt_noram5] 0.37u 2.89s 0% > 6396k > > and I can switch among them and ^T, but trying to run top is stuck (in > flt_noram5). I'll give it an hour or so, and have a look at the > console. Curious - do you have swap configured? On what kind of device? I'm wondering if a pageout is wedged waiting for memory... -- Paul Ripke "Great minds discuss ideas, average minds discuss events, small minds discuss people." -- Disputed: Often attributed to Eleanor Roosevelt. 1948.
Re: weird hangs in current (ghc, gnucash)
Should we back out ad's changes until he has time to look at them? Thomas On Wed, Nov 01, 2023 at 09:36:01AM +, Chavdar Ivanov wrote: > This weird hang still takes place on > > ❯ uname -a > NetBSD ymir.lorien.lan 10.99.10 NetBSD 10.99.10 (GENERIC) #13: Mon Oct > 30 19:45:39 GMT 2023 > sysbu...@ymir.lorien.lan:/dumps/sysbuild/amd64/obj/home/sysbuild/src/sys/arch/amd64/com > pile/GENERIC amd64 > > - again during building a haskell package: > > ===> Configuring for hs-tagged-0.8.8 > [1 of 2] Compiling Main ( Setup.lhs, Setup.o ) > > > Htop gives weird output for the process not-yet-created: > > 11506 root63 0 33283 873 S 0.0 0.0 0:00.00 | `- make > 20458 root62 0 34832 613 S 0.0 0.0 0:00.00 | `- > /bin/sh -c set -e; test -n "" && echo 1>&2 "ERROR:" && exit > 1; exec 3<&0;??? whil > 24942 root63 0 33296 882 S 0.0 0.0 0:00.00 | `- > /usr/bin/make _MAKE=/usr/bin/make OPSYS=NetBSD OS_VERSION=10.99.10 > OPSYS_VERSION=109910 LOWE > 21643 root58 0 34302 606 S 0.0 0.0 0:00.00 | `- > /bin/sh -c set -e;? if test -n "" && /usr/pkg/sbin/pkg_info -K > /usr/pkg/pkgdb -qe hs > 19149 root63 0 34367 920 S 0.0 0.0 0:00.00 | `- > /usr/bin/make LOWER_OPSYS=netbsd _PKGSRC_BARRIER=yes > ALLOW_VULNERABLE_PACKAGES= reinst > 23303 root58 0 33685 603 S 0.0 0.0 0:00.00 | `- > /bin/sh -c set -e; ulimit -d `ulimit -H -d`; ulimit -v `ulimit -H -v`; > cd /usr/pkgs > 27078 root21 0 256G 37735 S 0.0 0.9 0:00.00 | `- > /usr/pkg/lib/ghc-9.6.3/bin/./ghc-9.6.3 -B/usr/pkg/lib/ghc-9.6.3/lib > -package-env > 22058 root -22 0 0 0 Z 0.0 0.0 0:00.00 | > `- gcc <== > --- > > > I guess it is back to the kernel from the 9th of October. > > Chavdar > > - > > On Mon, 23 Oct 2023 at 09:27, Chavdar Ivanov wrote: > > > > I can confirm that after reverting to the kernel from 9th of October > > devel/happy builds OK. > > > > On Mon, 23 Oct 2023 at 05:56, Markus Kilbinger wrote: > >> > >> ... and probably > >> > >> 3. PR kern/57660 > >> https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=57660 > >> > >> Markus > >> > >> Am So., 22. Okt. 2023 um 23:10 Uhr schrieb Thomas Klausner > >> : > >> > > >> > On Sun, Oct 22, 2023 at 11:06:25PM +0200, Thomas Klausner wrote: > >> > > On Sun, Oct 22, 2023 at 10:37:54PM +0200, Thomas Klausner wrote: > >> > > > I've just updated my kernel from 10.99.10 to 10.99.10 (~ Oct 11 to > >> > > > Oct > >> > > > 20) to test the rge(4) changes, and started a bulk build, and the > >> > > > packages using ghc seem to wait for something and make no progress. > >> > > ... > >> > > > I see one other new weird behaviour on that machine - gnucash doesn't > >> > > > finish starting up. > >> > > > >> > > I've backed out ad's changes from the 13th, and both problems are gone. > >> > > > >> > > I'll attach my local change. > >> > > > >> > > Andrew, can you please take a look? > >> > > >> > Two test cases to see the problem I have: > >> > > >> > 1. start gnucash, it doesn't finish starting up, the splash screen hangs. > >> > > >> > 2. cd /usr/pkgsrc/devel/hs-data-array-byte && make > >> >The 'build' step has two parts, it hangs after the first one. > >> > > >> > Thomas > > > > > > > > -- > > > > > > -- >
Re: weird hangs in current (ghc, gnucash)
This weird hang still takes place on ❯ uname -a NetBSD ymir.lorien.lan 10.99.10 NetBSD 10.99.10 (GENERIC) #13: Mon Oct 30 19:45:39 GMT 2023 sysbu...@ymir.lorien.lan:/dumps/sysbuild/amd64/obj/home/sysbuild/src/sys/arch/amd64/com pile/GENERIC amd64 - again during building a haskell package: ===> Configuring for hs-tagged-0.8.8 [1 of 2] Compiling Main ( Setup.lhs, Setup.o ) Htop gives weird output for the process not-yet-created: 11506 root63 0 33283 873 S 0.0 0.0 0:00.00 | `- make 20458 root62 0 34832 613 S 0.0 0.0 0:00.00 | `- /bin/sh -c set -e; test -n "" && echo 1>&2 "ERROR:" && exit 1; exec 3<&0;??? whil 24942 root63 0 33296 882 S 0.0 0.0 0:00.00 | `- /usr/bin/make _MAKE=/usr/bin/make OPSYS=NetBSD OS_VERSION=10.99.10 OPSYS_VERSION=109910 LOWE 21643 root58 0 34302 606 S 0.0 0.0 0:00.00 | `- /bin/sh -c set -e;? if test -n "" && /usr/pkg/sbin/pkg_info -K /usr/pkg/pkgdb -qe hs 19149 root63 0 34367 920 S 0.0 0.0 0:00.00 | `- /usr/bin/make LOWER_OPSYS=netbsd _PKGSRC_BARRIER=yes ALLOW_VULNERABLE_PACKAGES= reinst 23303 root58 0 33685 603 S 0.0 0.0 0:00.00 | `- /bin/sh -c set -e; ulimit -d `ulimit -H -d`; ulimit -v `ulimit -H -v`; cd /usr/pkgs 27078 root21 0 256G 37735 S 0.0 0.9 0:00.00 | `- /usr/pkg/lib/ghc-9.6.3/bin/./ghc-9.6.3 -B/usr/pkg/lib/ghc-9.6.3/lib -package-env 22058 root -22 0 0 0 Z 0.0 0.0 0:00.00 | `- gcc <== --- I guess it is back to the kernel from the 9th of October. Chavdar - On Mon, 23 Oct 2023 at 09:27, Chavdar Ivanov wrote: > > I can confirm that after reverting to the kernel from 9th of October > devel/happy builds OK. > > On Mon, 23 Oct 2023 at 05:56, Markus Kilbinger wrote: >> >> ... and probably >> >> 3. PR kern/57660 >> https://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=57660 >> >> Markus >> >> Am So., 22. Okt. 2023 um 23:10 Uhr schrieb Thomas Klausner : >> > >> > On Sun, Oct 22, 2023 at 11:06:25PM +0200, Thomas Klausner wrote: >> > > On Sun, Oct 22, 2023 at 10:37:54PM +0200, Thomas Klausner wrote: >> > > > I've just updated my kernel from 10.99.10 to 10.99.10 (~ Oct 11 to Oct >> > > > 20) to test the rge(4) changes, and started a bulk build, and the >> > > > packages using ghc seem to wait for something and make no progress. >> > > ... >> > > > I see one other new weird behaviour on that machine - gnucash doesn't >> > > > finish starting up. >> > > >> > > I've backed out ad's changes from the 13th, and both problems are gone. >> > > >> > > I'll attach my local change. >> > > >> > > Andrew, can you please take a look? >> > >> > Two test cases to see the problem I have: >> > >> > 1. start gnucash, it doesn't finish starting up, the splash screen hangs. >> > >> > 2. cd /usr/pkgsrc/devel/hs-data-array-byte && make >> >The 'build' step has two parts, it hangs after the first one. >> > >> > Thomas > > > > -- > --