Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-06-09 Thread Steve Kargl
On Sat, Jun 09, 2018 at 06:07:15PM -0700, Don Lewis wrote:
> On  9 Jun, Stefan Esser wrote:
> 
> > 3) Programs that evenly split the load on all available cores have been
> >suffering from sub-optimal assignment of threads to cores. E.g. on a
> >CPU with 8 (virtual) cores, this resulted in 6 cores running the load
> >in nominal time, 1 core taking twice as long because 2 threads were
> >scheduled to run on it, while 1 core was mostly idle. Even if the
> >load was initially evenly distributed, a woken up process that ran on
> >one core destroyed the symmetry and it was not recovered. (This was a
> >problem e.g. for parallel programs using MPI or the like.)
> 
> When a core is about to go idle or first enters the idle state it will
> search for the most heavily loaded core and steal a thread from it.  The
> core will only go to sleep if it can't find a non-running thread to
> steal.
> 
> If there are N cores and N+1 runnable threads, there is a long term load
> balancer than runs periodically.  It searches for the most and least
> loaded cores and moves a thread from the former to the latter.  That
> prevents the same pair of threads from having to share the same core
> indefinitely.
> 
> There is an observed bug where a low priority thread can get pinned to a
> particular core that is already occupied by a high-priority CPU-bound
> thread that never releases the CPU.  The low priority thread can't
> migrate to another core that subsequently becomes available because it
> it is pinned.  It is not known how the thread originally got into this
> state.  I don't see any reason for 4BSD to be immune to this problem.
> 

It is a well-known problem that an over-subscribed ULE kernel
has much worse performance than a 4BSD kernel.  I've posted
more than once with benchmark numbers that demonstrate the problem.

-- 
Steve
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-06-09 Thread Don Lewis
On  9 Jun, Stefan Esser wrote:

> 3) Programs that evenly split the load on all available cores have been
>suffering from sub-optimal assignment of threads to cores. E.g. on a
>CPU with 8 (virtual) cores, this resulted in 6 cores running the load
>in nominal time, 1 core taking twice as long because 2 threads were
>scheduled to run on it, while 1 core was mostly idle. Even if the
>load was initially evenly distributed, a woken up process that ran on
>one core destroyed the symmetry and it was not recovered. (This was a
>problem e.g. for parallel programs using MPI or the like.)

When a core is about to go idle or first enters the idle state it will
search for the most heavily loaded core and steal a thread from it.  The
core will only go to sleep if it can't find a non-running thread to
steal.

If there are N cores and N+1 runnable threads, there is a long term load
balancer than runs periodically.  It searches for the most and least
loaded cores and moves a thread from the former to the latter.  That
prevents the same pair of threads from having to share the same core
indefinitely.

There is an observed bug where a low priority thread can get pinned to a
particular core that is already occupied by a high-priority CPU-bound
thread that never releases the CPU.  The low priority thread can't
migrate to another core that subsequently becomes available because it
it is pinned.  It is not known how the thread originally got into this
state.  I don't see any reason for 4BSD to be immune to this problem.

> 4) The real time behavior of SCHED_ULE is weak due to interactive
>processes (e.g. the X server) being put into the "time-share" class
>and then suffering from the problems described as 1) or 2) above.
>(You distinguish time-share and batch processes, which both are
> allowed to consume their full quanta even of a higher priority
> process in their class becomes runnable. I think this will not
> give the required responsiveness e.g. for an X server.)
>They should be considered I/O intensive, if they often don't use
>their full quantum, without taking the significant amount of CPU
>time they may use at times into account. (I.e. the criterion for
>time-sharing should not be the CPU time consumed, but rather some
>fraction of the quanta not being fully used due to voluntarily giving
>up the CPU.) With many real-time threads it may be hard to identify
>interactive threads, since they are non-voluntarily disrupted too
>often - this must be considered in the sampling of voluntary vs.
>non-voluntary context switches.

It can actually be worse than this.  There is a bug that can cause the
wnck-applet component of the MATE desktop to consume a large amount of
CPU time, and apparently it is communicating with the Xorg server, which it 
drives to 100% CPU.  That makes it's PRI value increase greatly so
it has a lower scheduling priority.  Even without competing CPU load,
interactive performance is hurt.  With competing CPU load it gets much
worse.

___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-06-09 Thread Rodney W. Grimes
> On Fri, 8 Jun 2018 17:18:43 +0300
> Andriy Gapon  wrote:
> 
> > On 08/06/2018 15:27, Gary Jennejohn wrote:
> > > On Thu, 7 Jun 2018 20:14:10 +0300
> > > Andriy Gapon  wrote:
> > >   
> > >> On 03/05/2018 12:41, Andriy Gapon wrote:  
> > >>> I think that we need preemption policies that might not be expressible 
> > >>> as one or
> > >>> two numbers.  A policy could be something like this:
> > >>> - interrupt threads can preempt only threads from "lower" classes: 
> > >>> real-time,
> > >>> kernel, timeshare, idle;
> > >>> - interrupt threads cannot preempt other interrupt threads
> > >>> - real-time threads can preempt other real-time threads and threads 
> > >>> from "lower"
> > >>> classes: kernel, timeshare, idle
> > >>> - kernel threads can preempt only threads from lower classes: 
> > >>> timeshare, idle
> > >>> - interactive timeshare threads can only preempt batch and idle threads
> > >>> - batch threads can only preempt idle threads
> > >>
> > >>
> > >> Here is a sketch of the idea: https://reviews.freebsd.org/D15693
> > >>  
> > > 
> > > What about SCHED_4BSD?  Or is this just an example and you chose
> > > SCHED_ULE for it?  
> > 
> > I haven't looked at SCHED_4BSD code at all.
> > 
> 
> I hope you will eventually because that's what I use.  I find its
> scheduling of interactive processes much better than ULE.

+1

Bruce Evans may have some info and/or changes here too.


-- 
Rod Grimes rgri...@freebsd.org
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-06-09 Thread Stefan Esser
Am 07.06.18 um 19:14 schrieb Andriy Gapon:
> On 03/05/2018 12:41, Andriy Gapon wrote:
>> I think that we need preemption policies that might not be expressible as 
>> one or
>> two numbers.  A policy could be something like this:
>> - interrupt threads can preempt only threads from "lower" classes: real-time,
>> kernel, timeshare, idle;
>> - interrupt threads cannot preempt other interrupt threads
>> - real-time threads can preempt other real-time threads and threads from 
>> "lower"
>> classes: kernel, timeshare, idle
>> - kernel threads can preempt only threads from lower classes: timeshare, idle
>> - interactive timeshare threads can only preempt batch and idle threads
>> - batch threads can only preempt idle threads
> 
> Here is a sketch of the idea: https://reviews.freebsd.org/D15693

Hi Andriy,

I highly appreciate your effort to improve the scheduling in SCHED_ULE.

But I'm afraid, that your scheme will not fix the problem. As you may
know, there are a number of problems with SCHED_ULE, which let quite a
number of users prefer SCHED_4BSD even on multi-core systems.

The problems I'm aware of:

1) On UP systems, I/O intensive applications may be starved by compute
   intensive processes that are allowed to consume their full quantum of
   time (limiting reads to some 10 per second worst case).

2) Similarly, on SMP systems with load higher than the number of cores
   (virtual cores in case of HT), the compute bound cores can slow down
   a cp of a large file from 100s of MB/s to 100s of KB/s, under certain
   circumstances.

3) Programs that evenly split the load on all available cores have been
   suffering from sub-optimal assignment of threads to cores. E.g. on a
   CPU with 8 (virtual) cores, this resulted in 6 cores running the load
   in nominal time, 1 core taking twice as long because 2 threads were
   scheduled to run on it, while 1 core was mostly idle. Even if the
   load was initially evenly distributed, a woken up process that ran on
   one core destroyed the symmetry and it was not recovered. (This was a
   problem e.g. for parallel programs using MPI or the like.)

4) The real time behavior of SCHED_ULE is weak due to interactive
   processes (e.g. the X server) being put into the "time-share" class
   and then suffering from the problems described as 1) or 2) above.
   (You distinguish time-share and batch processes, which both are
allowed to consume their full quanta even of a higher priority
process in their class becomes runnable. I think this will not
give the required responsiveness e.g. for an X server.)
   They should be considered I/O intensive, if they often don't use
   their full quantum, without taking the significant amount of CPU
   time they may use at times into account. (I.e. the criterion for
   time-sharing should not be the CPU time consumed, but rather some
   fraction of the quanta not being fully used due to voluntarily giving
   up the CPU.) With many real-time threads it may be hard to identify
   interactive threads, since they are non-voluntarily disrupted too
   often - this must be considered in the sampling of voluntary vs.
   non-voluntary context switches.

5) The NICE parameter has hardly any effect on the scheduling. Processes
   started with nice 19 get nearly the same share of the CPU as processes
   at nice 0, while they should traditionally only run when a core was
   idle, otherwise. Nice values between 0 and 19 have even less effect
   (hardly any).

I have not had time to try the patch in that review, but I think that
the cause of scheduling problems is not localized in that function.

And a solution should be based on typical use cases or sample scenarios
being applied to a scheduling policy. There are some easy cases (e.g. a
"random" load of independent processes like a parallel make run), where
only cache effects are relevant (try to keep a thread on its CPU as long
as possible and, if interrupted, continue it on that CPU if you can assume
there is still significant cached state).

There have been excessive KTR traces that showed the scheduler behavior
under specific loads, especially MPI, and there have been attempts to
fix the uneven distribution of processes for that case (but AFAIR not
with good success).

Your patches may be part of the solution, with at least 3 other parts
remaining:

1) The classification of interactive and time-share should be separate.
   Interactive means that the process does not use its full quantum in
   a non-negligible fraction of cases. The X server or a DBMS server
   should not be considered compute intensive, or request rates will
   be as low as 10 per second (if the time-share quantum is in the
   order of 100 ms).

2) The scheduling should guarantee symmetric distribution of the load
   for scenarios as parallel programs with MPI. Since OpenMP and other
   mechanism have similar requirements, this will become more relevant
   over time.

3) The nice-ness of a process should be relevant, to 

Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-06-08 Thread Gary Jennejohn
On Fri, 8 Jun 2018 17:18:43 +0300
Andriy Gapon  wrote:

> On 08/06/2018 15:27, Gary Jennejohn wrote:
> > On Thu, 7 Jun 2018 20:14:10 +0300
> > Andriy Gapon  wrote:
> >   
> >> On 03/05/2018 12:41, Andriy Gapon wrote:  
> >>> I think that we need preemption policies that might not be expressible as 
> >>> one or
> >>> two numbers.  A policy could be something like this:
> >>> - interrupt threads can preempt only threads from "lower" classes: 
> >>> real-time,
> >>> kernel, timeshare, idle;
> >>> - interrupt threads cannot preempt other interrupt threads
> >>> - real-time threads can preempt other real-time threads and threads from 
> >>> "lower"
> >>> classes: kernel, timeshare, idle
> >>> - kernel threads can preempt only threads from lower classes: timeshare, 
> >>> idle
> >>> - interactive timeshare threads can only preempt batch and idle threads
> >>> - batch threads can only preempt idle threads
> >>
> >>
> >> Here is a sketch of the idea: https://reviews.freebsd.org/D15693
> >>  
> > 
> > What about SCHED_4BSD?  Or is this just an example and you chose
> > SCHED_ULE for it?  
> 
> I haven't looked at SCHED_4BSD code at all.
> 

I hope you will eventually because that's what I use.  I find its
scheduling of interactive processes much better than ULE.

-- 
Gary Jennejohn
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-06-08 Thread Andriy Gapon
On 08/06/2018 15:27, Gary Jennejohn wrote:
> On Thu, 7 Jun 2018 20:14:10 +0300
> Andriy Gapon  wrote:
> 
>> On 03/05/2018 12:41, Andriy Gapon wrote:
>>> I think that we need preemption policies that might not be expressible as 
>>> one or
>>> two numbers.  A policy could be something like this:
>>> - interrupt threads can preempt only threads from "lower" classes: 
>>> real-time,
>>> kernel, timeshare, idle;
>>> - interrupt threads cannot preempt other interrupt threads
>>> - real-time threads can preempt other real-time threads and threads from 
>>> "lower"
>>> classes: kernel, timeshare, idle
>>> - kernel threads can preempt only threads from lower classes: timeshare, 
>>> idle
>>> - interactive timeshare threads can only preempt batch and idle threads
>>> - batch threads can only preempt idle threads  
>>
>>
>> Here is a sketch of the idea: https://reviews.freebsd.org/D15693
>>
> 
> What about SCHED_4BSD?  Or is this just an example and you chose
> SCHED_ULE for it?

I haven't looked at SCHED_4BSD code at all.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-06-08 Thread Gary Jennejohn
On Thu, 7 Jun 2018 20:14:10 +0300
Andriy Gapon  wrote:

> On 03/05/2018 12:41, Andriy Gapon wrote:
> > I think that we need preemption policies that might not be expressible as 
> > one or
> > two numbers.  A policy could be something like this:
> > - interrupt threads can preempt only threads from "lower" classes: 
> > real-time,
> > kernel, timeshare, idle;
> > - interrupt threads cannot preempt other interrupt threads
> > - real-time threads can preempt other real-time threads and threads from 
> > "lower"
> > classes: kernel, timeshare, idle
> > - kernel threads can preempt only threads from lower classes: timeshare, 
> > idle
> > - interactive timeshare threads can only preempt batch and idle threads
> > - batch threads can only preempt idle threads  
> 
> 
> Here is a sketch of the idea: https://reviews.freebsd.org/D15693
> 

What about SCHED_4BSD?  Or is this just an example and you chose
SCHED_ULE for it?

-- 
Gary Jennejohn
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-06-07 Thread Andriy Gapon
On 03/05/2018 12:41, Andriy Gapon wrote:
> I think that we need preemption policies that might not be expressible as one 
> or
> two numbers.  A policy could be something like this:
> - interrupt threads can preempt only threads from "lower" classes: real-time,
> kernel, timeshare, idle;
> - interrupt threads cannot preempt other interrupt threads
> - real-time threads can preempt other real-time threads and threads from 
> "lower"
> classes: kernel, timeshare, idle
> - kernel threads can preempt only threads from lower classes: timeshare, idle
> - interactive timeshare threads can only preempt batch and idle threads
> - batch threads can only preempt idle threads


Here is a sketch of the idea: https://reviews.freebsd.org/D15693

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-04-05 Thread Stefan Esser
Am 04.04.18 um 18:45 schrieb Andriy Gapon:
> On 04/04/2018 16:19, Stefan Esser wrote:
>> I have identified the cause of the extremely low I/O performance (2 to 6 read
>> operations scheduled per second).
>>
>> The default value of kern.sched.preempt_thresh=0 does not give any CPU to the
>> I/O bound process unless a (long) time slice expires 
>> (kern.sched.quantum=94488
>> on my system with HZ=1000) or one of the CPU bound processes voluntarily 
>> gives
>> up the CPU (or exits).
>>
>> Any non-zero value of preemt_thresh lets the system perform I/O in parallel
>> with the CPU bound processes, again.
> 
> Let me guess... you have a custom kernel configuration and, unlike GENERIC
> (assuming x86), it does not have 'options PREEMPTION'?

Yes, thank you for pointing that out!!!

I used to have PREEMPTION and FULL_PREEMPTION in my kernel configuration,
and apparently have deleted both options when only FULL_PREEMPTION was
supposed to go ...


After looking at sched_ule.c and top/machine.c it appears, that the value
of preempt_thresh corresponds to the PRI value as shown by top (or ps -l)
plus PZERO which is calculated as (PRI_MIN_KERN=80) + 20.

What I do not understand, though, is that the decision about a preemption
is only based on the calculated new priority of the thread, but not at all
on the priority of other running threads (except the idle thread).

On my system, a "real" batch job (i.e. one that does not voluntarily give
up the CPU due to I/O) seems to have a PRI value of 80 to 100 (growing
over time), while an interactive process has a PRI of 20, a maximally
"niced" interactive process has 52.

So, I'd expect a reasonable default value of preempt_thresh to be slightly
above 120 (e.g. 124) to prevent I/O heavy threads from stealing each other
the CPU too often, and to prevent "niced" processes from doing the same ...

The two values configured into the kernel (80 for PREEMPTION and 255 for
FULL_PREEMPTION) seem to be extremes, but something in between (e.g. 124)
is not offered (can only be configured via sysctl without any information
for the correspondence between the threshold value and the PRI value in
any document I've found, besides the kernel sources ...).


Is PRI_MIN_KERN=80 really a good default value for the preemption threshold?

Regards, STefan
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Is kern.sched.preempt_thresh=0 a sensible default?

2018-04-04 Thread Andriy Gapon
On 04/04/2018 16:19, Stefan Esser wrote:
> I have identified the cause of the extremely low I/O performance (2 to 6 read
> operations scheduled per second).
> 
> The default value of kern.sched.preempt_thresh=0 does not give any CPU to the
> I/O bound process unless a (long) time slice expires (kern.sched.quantum=94488
> on my system with HZ=1000) or one of the CPU bound processes voluntarily gives
> up the CPU (or exits).
> 
> Any non-zero value of preemt_thresh lets the system perform I/O in parallel
> with the CPU bound processes, again.

Let me guess... you have a custom kernel configuration and, unlike GENERIC
(assuming x86), it does not have 'options PREEMPTION'?

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"