Re: Frequent disk I/O stalls while building (poudriere), processes in "zfs tear" state

2021-04-16 Thread Felix Palmen
* Dewayne Geraghty  [20210416 06:26]:
> On 16/04/2021 2:29 am, Felix Palmen wrote:
> > Right now, I'm running a test with idprio 0 instead, which still seems
> > to have the desired effect, and so far, I didn't have any of these
> > stalls. If this persists, the problem is solved for me!
> > 
> > I'd still be curious about what might be the cause, and, what this state
> > "zfs tear" actually means. But that's kind of an "academic interest"
> > now.
> 
> Most likely your other processes are pre-empting your build, which is
> what you want :).

Yes, that's exactly the plan.

> Use /usr/bin/top to see the priority of the processes (ie under the  PRI
> column).  Using an idprio 22, means (on my 12.2Stable) a PRI of 146.  If
> your kern.sched.preempt_thresh is using the default (of 80) then
> processes with a PRI of <80 will preempt (for IO).

I was doing that a lot, that's how I found those "global" I/O stalls
were happening when some processes were in that "zfs tear" state (shown
in top only as "zfs te").

> Even with an idprio 0, the PRI is 124. So I suspect that was more a
> matter of timing (ie good luck).

That seems kind of unlikely because the behavior is pretty reproducible.
Having observed builds on idprio 0 (yes, this results in a priority of
124) for a while, I still see from time to time processes getting
"stuck" for a few seconds, mostly ccache processes, but now in state
"zfsvfs" and the rest of the system is not affected, I/O still works.

So, something did change with ZFS and priorities between 12.2 and 13.0.
Running the whole builds on idprio 22 worked fine on 12.2.

> You could increase your pre-emption threshold for the duration of the
> build, to include your nice value. But... (not really a good idea).

That would clearly defeat the purpose, yes ;)

> Re zfs - sorry, I'm peculiar and don't use it ;)

I suspect the relevant change to be exactly in that context, still
thanks for answering :) Now that I have a working solution, it isn't an
important issue for me any more. Curiosity remains…

-- 
 Dipl.-Inform. Felix Palmen ,.//..
 {web}  http://palmen-it.de  {jabber} [see email]   ,//palmen-it.de
 {pgp public key} http://palmen-it.de/pub.txt   //   """
 {pgp fingerprint} A891 3D55 5F2E 3A74 3965 B997 3EF2 8B0A BC02 DA2A


signature.asc
Description: PGP signature


Re: Frequent disk I/O stalls while building (poudriere), processes in "zfs tear" state

2021-04-15 Thread Dewayne Geraghty
On 16/04/2021 2:29 am, Felix Palmen wrote:
> After more experimentation, I finally found what's causing these
> problems for me on 13:
> 
> * Felix Palmen  [20210412 11:44]:
>> * Poudriere running on idprio 22 with 8 parallel build jobs
> 
> Running poudriere with normal priority works perfectly fine. Now, I've
> had poudriere running on idprio because there are several other services
> on that machine that shouldn't be slowed down by running a heavy build
> and I still want to use all the CPU resources available for building.
> 
> Right now, I'm running a test with idprio 0 instead, which still seems
> to have the desired effect, and so far, I didn't have any of these
> stalls. If this persists, the problem is solved for me!
> 
> I'd still be curious about what might be the cause, and, what this state
> "zfs tear" actually means. But that's kind of an "academic interest"
> now.
> 

Most likely your other processes are pre-empting your build, which is
what you want :).

Use /usr/bin/top to see the priority of the processes (ie under the  PRI
column).  Using an idprio 22, means (on my 12.2Stable) a PRI of 146.  If
your kern.sched.preempt_thresh is using the default (of 80) then
processes with a PRI of <80 will preempt (for IO).

Even with an idprio 0, the PRI is 124. So I suspect that was more a
matter of timing (ie good luck).

You could increase your pre-emption threshold for the duration of the
build, to include your nice value. But... (not really a good idea).
Better if you run your build using nice (PRI of 76) which should avoid
the stalls, but should also influence your more important services.

Re zfs - sorry, I'm peculiar and don't use it ;)
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Frequent disk I/O stalls while building (poudriere), processes in "zfs tear" state

2021-04-15 Thread Felix Palmen
After more experimentation, I finally found what's causing these
problems for me on 13:

* Felix Palmen  [20210412 11:44]:
> * Poudriere running on idprio 22 with 8 parallel build jobs

Running poudriere with normal priority works perfectly fine. Now, I've
had poudriere running on idprio because there are several other services
on that machine that shouldn't be slowed down by running a heavy build
and I still want to use all the CPU resources available for building.

Right now, I'm running a test with idprio 0 instead, which still seems
to have the desired effect, and so far, I didn't have any of these
stalls. If this persists, the problem is solved for me!

I'd still be curious about what might be the cause, and, what this state
"zfs tear" actually means. But that's kind of an "academic interest"
now.

-- 
 Dipl.-Inform. Felix Palmen ,.//..
 {web}  http://palmen-it.de  {jabber} [see email]   ,//palmen-it.de
 {pgp public key} http://palmen-it.de/pub.txt   //   """
 {pgp fingerprint} A891 3D55 5F2E 3A74 3965 B997 3EF2 8B0A BC02 DA2A


signature.asc
Description: PGP signature


Frequent disk I/O stalls while building (poudriere), processes in "zfs tear" state

2021-04-12 Thread Felix Palmen
Hello all,

since following the releng/13.0 branch, I experience stalled disk I/O
quite often (ca. once per minute) while building packages with poudriere.

What I can see in this case is the CPU going almost idle, and several
processes shown in `top` in state "zfs te" (and procstat shows "zfs
tear" for that). For up to several seconds, no disk I/O completes (even
starting a new process is impossible), then it recovers. Only two times,
I have seen the system going into a deadlock instead, with printing
messages similar to this to the serial console:

  swap_pager: indefinite wait buffer ...

I have this behavior since -RC3 (followed releng/13.0 now up to
-RELEASE). Before that, I had the vnlru-related problem that was fixed
with faa41af1fed350327cc542cb240ca2c6e1e8ba0c.

Some details:

* CPU: Intel(R) Xeon(R) CPU E3-1240L v5 @ 2.10GHz
* RAM: 64GB (ECC)
* Four HDDs (Seagate NAS models), 4TB each
* Swap 16GB, striped over the 4 disks
* Pool: 12TB raid-z on GELI-encrypted partitions. NOT upgraded yet, so I
  have a way back to 12.2.
* Two bhyve VMs running with 1GB and 8GB RAM, both wired
* Several jails running services like samba, an MTA, nginx...
* Several NFS shares mounted by other machines
* Poudriere running on idprio 22 with 8 parallel build jobs

Reducing the parallel jobs in poudriere also reduces the frequency of
the problem, but it doesn't seem to completely go away. Also, I have the
impression running into these stalls is more likely when a lot of
compilation jobs can be satisfied from ccache.

Thanks for any ideas and insight (e.g. what this "zfs tear" status
means).

Best regards,
Felix Palmen

-- 
 Dipl.-Inform. Felix Palmen ,.//..
 {web}  http://palmen-it.de  {jabber} [see email]   ,//palmen-it.de
 {pgp public key} http://palmen-it.de/pub.txt   //   """
 {pgp fingerprint} A891 3D55 5F2E 3A74 3965 B997 3EF2 8B0A BC02 DA2A


signature.asc
Description: PGP signature