Hello all,

since following the releng/13.0 branch, I experience stalled disk I/O
quite often (ca. once per minute) while building packages with poudriere.

What I can see in this case is the CPU going almost idle, and several
processes shown in `top` in state "zfs te" (and procstat shows "zfs
tear" for that). For up to several seconds, no disk I/O completes (even
starting a new process is impossible), then it recovers. Only two times,
I have seen the system going into a deadlock instead, with printing
messages similar to this to the serial console:

  swap_pager: indefinite wait buffer ...

I have this behavior since -RC3 (followed releng/13.0 now up to
-RELEASE). Before that, I had the vnlru-related problem that was fixed
with faa41af1fed350327cc542cb240ca2c6e1e8ba0c.

Some details:

* CPU: Intel(R) Xeon(R) CPU E3-1240L v5 @ 2.10GHz
* RAM: 64GB (ECC)
* Four HDDs (Seagate NAS models), 4TB each
* Swap 16GB, striped over the 4 disks
* Pool: 12TB raid-z on GELI-encrypted partitions. NOT upgraded yet, so I
  have a way back to 12.2.
* Two bhyve VMs running with 1GB and 8GB RAM, both wired
* Several jails running services like samba, an MTA, nginx...
* Several NFS shares mounted by other machines
* Poudriere running on idprio 22 with 8 parallel build jobs

Reducing the parallel jobs in poudriere also reduces the frequency of
the problem, but it doesn't seem to completely go away. Also, I have the
impression running into these stalls is more likely when a lot of
compilation jobs can be satisfied from ccache.

Thanks for any ideas and insight (e.g. what this "zfs tear" status
means).

Best regards,
Felix Palmen

-- 
 Dipl.-Inform. Felix Palmen  <fe...@palmen-it.de>   ,.//..........
 {web}  http://palmen-it.de  {jabber} [see email]   ,//palmen-it.de
 {pgp public key}     http://palmen-it.de/pub.txt   //   """""""""""
 {pgp fingerprint} A891 3D55 5F2E 3A74 3965 B997 3EF2 8B0A BC02 DA2A

Attachment: signature.asc
Description: PGP signature

Reply via email to