Re: devd out of swap space ? (zfs arc related ?)

2018-04-09 Thread Gerrit Kühn
On Mon, 9 Apr 2018 15:27:45 -0400 Mike Tancsa  wrote
about devd out of swap space ? (zfs arc related ?):

> Anyone else seen anything like this on a recent RELENG11 STABLE ?

I think I have seen something similar last week with a -stable from
somewhen in March. Lots of processes crashed over night due to "out of
swap space" although there appeared to be plenty of both swap and RAM.
Somehow it looked arc-related to me, but I havn't been able to reproduce
it so far (however, I did not try too hard, either ;-).
This is what top shows on the machine right now:

CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 2776K Active, 7231M Inact, 31M Laundry, 24G Wired, 934M Buf, 288M Free
ARC: 20G Total, 6678M MFU, 13G MRU, 1060K Anon, 182M Header, 146M Other
 19G Compressed, 82G Uncompressed, 4.24:1 Ratio



cu
  Gerrit
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


devd out of swap space ? (zfs arc related ?)

2018-04-09 Thread Mike Tancsa
On one of my internal nfs test boxes, I have noticed that kernels from
March 28th and April 6th r332100 ended up with devd running out of swap
space and being killed at some point.

First time I thought perhaps a fluke due to some stress testing I was
doing.  But I left the box running over the weekend and Sat AM it died
at 3:51am (perhaps when periodic runs?)

I have lots of free memory, however, ARC is chewing up 27G of the 32G.

CPU:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 800K Active, 15M Inact, 9964K Laundry, 30G Wired, 516M Free
ARC: 28G Total, 4996M MFU, 23G MRU, 5440K Anon, 78M Header, 593M Other
 27G Compressed, 28G Uncompressed, 1.04:1 Ratio
Swap: 20G Total, 21M Used, 20G Free

Anyone else seen anything like this on a recent RELENG11 STABLE ?

---Mike


-- 
---
Mike Tancsa, tel +1 519 651 3400 x203
Sentex Communications, m...@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: more data: SCHED_ULE+PREEMPTION is the problem

2018-04-09 Thread Stefan Esser
Am 07.04.18 um 16:18 schrieb Peter:
> 3. kern.sched.preempt_thresh
> 
> I could make the problem disappear by changing kern.sched.preempt_thresh  from
> the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This seems to
> correspond to the disk interrupt threads, which run at intr:12 (i5-3570T) or
> intr:8 (p3).

[CC added to include Jeff as the author of the ULE scheduler ...]

Since I had somewhat similar problems on my systems (with 4 Quad-Core with SMT
enabled, i.e. 8 threads of execution) with compute bound processes keeping I/O
intensive processes from running (load average of 12 with 8 "CPUs"), and these
problems where affected by preempt_thresh, I checked how this variable is used
in the scheduler. The code is in /sys/kern/sched_ule.c.

It controls, whether a thread that has become runnable (e.g., after waiting
for disk I/O to complete) will preempt the thread currently running on "this"
CPU (i.e. the one executing this test in the kernel).

IMHO, sched_preempt should default to a much higher number than 80 (e.g. 190),
but maybe I misunderstand some of the details ...


static inline int
sched_shouldpreempt(int pri, int cpri, int remote)
{

The parameters are:

pri: the priority if the now runnable thread
cpri: the priority of the thread that currently runs on "this" CPU
remote: whether to consider preempting a thread on another CPU

The priority values are those displayed by top or ps -l as "PRI", but with an
offset of 100 applied (i.e. pri=120 is displayed as PRI=20 in top).

If this thread has less priority than the currently executing one (cpri), the
currently running thread will not be preempted:

/*


 * If the new priority is not better than the current priority there is


 * nothing to do.


 */
if (pri >= cpri)
return (0);

If the current thread is the idle thread, it will always be preempted by the
now runnable thread:

/*


 * Always preempt idle.


 */
if (cpri >= PRI_MIN_IDLE)
return (1);

A value of preempt_thresh=0 (e.g. if "options PREEMPTION" is missing in the
kernel config) lets the previously running thread continue (except if was the
idle thread, which has been dealt with above). The compute bound thread may
continue until its quantum has expired.

/*


 * If preemption is disabled don't preempt others.


 */
if (preempt_thresh == 0)
return (0);

For any other value of preempt_thresh, the new priority of the thread that
just has become runnable will be compared to preempt_thresh and if this new
priority is higher (lower numeric value) or equal to preempt_thresh, the
thread for which (e.g.) disk I/O finished will preempt the current thread:

/*


 * Preempt if we exceed the threshold.


 */
if (pri <= preempt_thresh)
return (1);

===> This is the only condition that depends on preempt_thresh > 0 <===

The flag "remote" controls whether this thread will be scheduled to run, if
its priority is higher or equal to PRI_MAX_INTERACT (less than or equal to
151) and if the opposite is true for the currently running thread (cpri).
The value of remote will always be 0 on kernels built without "options SMP".

/*


 * If we're interactive or better and there is non-interactive


 * or worse running preempt only remote processors.


 */
if (remote && pri <= PRI_MAX_INTERACT && cpri > PRI_MAX_INTERACT)
return (1);


The critical use of preempt_thresh is marked above. If it is 0, no preemption
will occur. On a single processor system, this should allow the CPU bound
thread to run for as long its quantum lasts.

A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound
thread to preempt any other thread with lower priority (cpri > pri). But in
case of a high priority kernel thread being active during this test (with a
low numeric cpri value), the I/O bound process will not preempt that higher
priority thread (i.e. some high priority kernel thread).

Whether the I/O bound thread will run (instead of the compute bound) after
the higher priority thread has given up the CPU, will depend on the scheduler
decision which thread to select. And for "timeshare" threads, this will often
not be the higher priority (I/O bound) thread, but the compute bound thread,
which then may execute until next being interrupted by the I/O bound thread
(which will not happen, if no new I/O has been requested).

This might explain, why setting preempt_thresh to a very low value (in the
range of real-time kernel threads) enforces preemption of the CPU bound
thread, while any higher (numeric) value of preempt_thresh prevents this
and makes tdq_choose() often select the low priority CPU bound over the
higher priority I/O bound thread.

BUT the first test in sched_shouldpreempt() should prevent any user process
from ever preempting a real-time thread "if (pri >= cpri) return 0;".

For preemption to occur,  pri must be numerically lower than cpri, and
pri