When I tried to get a core, I saw: > reboot 0x104
dumping to dev 168,2 (offset=73677660, size=33524130) dump ahcisata0 port 5: clearing WDCTL_RST failed for drive 0 wddump: device timed out i/o error rebooting... Thomas On Fri, Jun 28, 2019 at 11:39:05AM +0200, Thomas Klausner wrote: > Hi Frank! > > I checked some process states in ddb. > > "master", the 2 "bjam" and at least one "cp" hanging in tstile have: > sleepq_block() > turnstile_block() > rw_vector_enter() > genfs_lock() > VOP_LOCK() > vn_lock() > namei_tryemulroot() > namei() > check_exec() > execve_loadvm() > execve1() > syscall() > > These look quite similar to your backtraces. > > The "cp" hanging in biolock has: > sleepq_block > cv_timedwait > bbusy > getblk > bio_doread > ffs_init_vnode > ffs_newvnode > vcache_new > ufs_makeinode > ufs_create > VOP_CREATE > vn_open > do_open > do_sys_openat > sys_open > syscall > > I can't agree with the statement that it's a general -current problem > -- my current working machine does not have this issue. It "only" has > 32GB and 12 cores though, and no nvme. dmesg attached. > > Do you see the issue on machines without nvme? Just to eliminate that. > (I wanted to try replacing the nvme boot disk next.) > Thomas > > > On Fri, Jun 28, 2019 at 11:20:45AM +0200, Frank Kardel wrote: > > Hi Thomas, > > > > glad that this is observed elsewhere. > > > > Maybe following bugs could resonate with your observations: > > > > kern/54207 [serious/high]: > > -current locks up solidly when pkgsrc building > > adapta-gtk-theme-3.95.0.11 > > looks like locking issue in layerfs* (nullfs). (AMD 1800X, 64GB) > > > > kern/54210 [serious/high]: > > NetBSD-8 processes presumably not exiting > > not tested with -current,but may be there too. (Intel(R) Xeon(R) Gold 6134 > > CPU @ 3.20GHz, ~380Gb) > > > > At this time I am not too confident, that -current is reliably able to do a > > pkgsrc build, though I have seen occasionally bulk builds that did finish. > > Most of the time I run into hard lockups with no information about the > > system state available (no console, no X, no network, no DDB). > > > > Frank > > > > > > On 06/28/19 10:46, Thomas Klausner wrote: > > > Hi! > > > > > > I've set up a new machine for bulk building. I have tried various > > > things, but in the end it always hangs in tstile. > > > > > > First try was what I currently use: tmpfs sandboxes with nullfs > > > mounted /bin, /lib, ... When it hung, the suspicion was that it's > > > nullfs' fault. (The same setup works fine on my current machine.) > > > > > > The second try was tmpfs with copied-in /bin, /lib, ... and > > > NFS-mounted packages/distfiles/pkgsrc (from localhost). That also > > > hung. So the suspicion was that tmpfs or NFS are broken. > > > > > > The last try was building in the root file system, i.e. not even a > > > sandbox (chroot). The only tmpfs is in /dev. distfiles/pkgsrc/packages > > > are on spinning rust, / is on an ld@nvme. With 8 MAKE_JOBS this > > > finished one pkgsrc build (where some packages didn't build because of > > > missing distfiles, or because they randomly break like rust). When I > > > restarted the bulk build with 24 MAKE_JOBS, it hung after ~4 hours. > > > > > > I have the following systat output: > > > > > > 2 users Load 8.78 7.19 3.62 Fri Jun 28 > > > 04:27:32 > > > > > > Proc:r d s Csw Traps SysCal Intr Soft Fault PAGING > > > SWAPPING > > > 24 10 7548 265849 157956 3504 2399 265476 in out > > > in out > > > ops > > > 56.2% Sy 1.2% Us 0.0% Ni 0.0% In 42.5% Id pages > > > | | | | | | | | | | | > > > ============================> 670 > > > forks > > > > > > fkppw > > > Anon 294104 % zero 62161268 5572 Interrupts > > > fksvm > > > Exec 14116 % wired 16296 1968 TLB shootdown > > > pwait > > > File 24587740 18% inact 43756 100 cpu0 timer > > > relck > > > Meta 2606694 % bufs 495676 msi1 vec 0 > > > rlkok > > > (kB) real swaponly free 9 msix2 vec 0 > > > noram > > > Active 24835908 100033996 9 msix2 vec 1 57262 > > > ndcpy > > > Namei Sys-cache Proc-cache msix2 vec 2 27906 > > > fltcp > > > Calls hits % hits % 3427 ioapic1 pin 12 > > > 87178 zfod > > > 125076 122834 98 80 0 59 ioapic2 pin 0 > > > 35775 cow > > > msix7 vec 0 > > > 8192 fmin > > > Disks: seeks xfers bytes %busy > > > 10922 ftarg > > > ld0 1969 16130K 34.8 > > > itarg > > > dk0 1969 16130K 34.8 > > > flnan > > > wd0 > > > pdfre > > > dk1 > > > pdscn > > > dk2 > > > > > > and this from top: > > > > > > load averages: 5.13, 6.53, 3.56; up 1+16:08:05 > > > > > > > > > 04:28:13 > > > 59 processes: 2 runnable, 55 sleeping, 2 on CPU > > > CPU states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 99.9% > > > idle > > > Memory: 24G Act, 43M Inact, 16M Wired, 14M Exec, 23G File, 95G Free > > > Swap: 163G Total, 163G Free > > > > > > PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU > > > COMMAND > > > 10353 pbulk 77 0 185M 172M select/0 0:13 4.74% 4.54% bjam > > > 12120 wiz 109 0 83M 59M tstile/1 165:46 1.46% 1.46% systat > > > 0 root 0 0 0K 93M CPU/31 35:39 0.00% 0.00% > > > [system] > > > 219 root 85 0 32M 2676K kqueue/4 7:34 0.00% 0.00% > > > syslogd > > > 13354 wiz 85 0 89M 4948K select/0 0:52 0.00% 0.00% sshd > > > 380 root 85 0 30M 16M pause/4 0:04 0.00% 0.00% ntpd > > > 10918 wiz 43 0 25M 2872K CPU/3 0:01 0.00% 0.00% top > > > 1 root 85 0 20M 1756K wait/29 0:01 0.00% 0.00% init > > > 5594 pbulk 0 0 0K 0K RUN/0 0:00 0.00% 0.00% bjam > > > 22861 pbulk 0 0 0K 0K RUN/0 0:00 0.00% 0.00% bjam > > > 747 root 117 0 20M 2080K tstile/8 0:00 0.00% 0.00% cron > > > 16473 pbulk 117 0 18M 1564K tstile/2 0:00 0.00% 0.00% cp > > > 9705 pbulk 117 0 15M 1564K bioloc/5 0:00 0.00% 0.00% cp > > > 7301 pbulk 117 0 15M 1560K tstile/2 0:00 0.00% 0.00% cp > > > 22971 pbulk 117 0 19M 1520K tstile/1 0:00 0.00% 0.00% cp > > > 10013 pbulk 117 0 15M 1520K tstile/1 0:00 0.00% 0.00% cp > > > 3411 pbulk 117 0 15M 1520K tstile/3 0:00 0.00% 0.00% cp > > > 5212 pbulk 117 0 15M 1520K tstile/2 0:00 0.00% 0.00% cp > > > 7072 pbulk 117 0 18M 1516K tstile/2 0:00 0.00% 0.00% cp > > > 8880 pbulk 117 0 15M 1516K tstile/2 0:00 0.00% 0.00% cp > > > 5869 pbulk 117 0 15M 1516K tstile/0 0:00 0.00% 0.00% cp > > > 10159 pbulk 117 0 15M 1516K tstile/1 0:00 0.00% 0.00% cp > > > 11783 pbulk 117 0 15M 1516K tstile/7 0:00 0.00% 0.00% cp > > > 7205 pbulk 117 0 15M 1512K tstile/1 0:00 0.00% 0.00% cp > > > 18676 pbulk 109 0 15M 1516K tstile/3 0:00 0.00% 0.00% cp > > > 7802 pbulk 109 0 15M 1516K tstile/2 0:00 0.00% 0.00% cp > > > 622 pbulk 109 0 15M 1512K tstile/2 0:00 0.00% 0.00% cp > > > 29434 pbulk 109 0 9576K 680K tstile/2 0:00 0.00% 0.00% cp > > > 2686 root 85 0 86M 6824K select/2 0:00 0.00% 0.00% sshd > > > 10052 root 85 0 89M 6784K select/2 0:00 0.00% 0.00% sshd > > > 674 root 85 0 70M 5056K wait/18 0:00 0.00% 0.00% login > > > 19345 wiz 85 0 86M 4960K select/3 0:00 0.00% 0.00% sshd > > > 652 postfix 85 0 57M 4848K kqueue/4 0:00 0.00% 0.00% qmgr > > > 4466 postfix 85 0 59M 4560K kqueue/0 0:00 0.00% 0.00% pickup > > > 441 root 85 0 70M 3412K select/2 0:00 0.00% 0.00% sshd > > > 656 root 85 0 57M 3328K kqueue/0 0:00 0.00% 0.00% master > > > 278 root 85 0 45M 2232K nfsd/31 0:00 0.00% 0.00% nfsd > > > 639 root 85 0 16M 2128K pause/0 0:00 0.00% 0.00% ksh > > > 21402 root 85 0 20M 1988K wait/0 0:00 0.00% 0.00% sh > > > 23371 root 85 0 20M 1972K wait/0 0:00 0.00% 0.00% sh > > > 3940 wiz 85 0 16M 1948K pause/23 0:00 0.00% 0.00% ksh > > > 8843 wiz 85 0 16M 1948K pause/5 0:00 0.00% 0.00% ksh > > > 227 root 85 0 20M 1940K select/1 0:00 0.00% 0.00% > > > rpcbind > > > 698 root 85 0 20M 1836K ttyraw/3 0:00 0.00% 0.00% getty > > > 542 root 85 0 20M 1832K ttyraw/2 0:00 0.00% 0.00% getty > > > 535 root 85 0 20M 1832K ttyraw/0 0:00 0.00% 0.00% getty > > > 531 root 85 0 25M 1644K kqueue/3 0:00 0.00% 0.00% inetd > > > 329 root 85 0 24M 1524K select/2 0:00 0.00% 0.00% mountd > > > 436 root 85 0 20M 1516K kqueue/2 0:00 0.00% 0.00% powerd > > > > > > On the console I see that it's currently trying to build > > > boost-headers, so it's not even something compile-heavy. > > > > > > The machine is still in this state and I have a PS/2 keyboard > > > attached, so let me know if you want to check something out. > > > > > > I'll attach the dmesg from 8.99.42 (it's currently at 8.99.48). > > > The kernel config is > > > > > > include "arch/amd64/conf/GENERIC" > > > options FONT_GO_MONO12x23 > > > no options FONT_BOLD16x32 > > > no options FONT_BOLD8x16 > > > > > > It's a 16-core AMD Threadripper system with 128GB RAM. > > > > > > What could go wrong here? I'm running out of ideas. > > > Thomas > > >