Re: Is kern.sched.preempt_thresh=0 a sensible default?
On Sat, Jun 09, 2018 at 06:07:15PM -0700, Don Lewis wrote: > On 9 Jun, Stefan Esser wrote: > > > 3) Programs that evenly split the load on all available cores have been > >suffering from sub-optimal assignment of threads to cores. E.g. on a > >CPU with 8 (virtual) cores, this resulted in 6 cores running the load > >in nominal time, 1 core taking twice as long because 2 threads were > >scheduled to run on it, while 1 core was mostly idle. Even if the > >load was initially evenly distributed, a woken up process that ran on > >one core destroyed the symmetry and it was not recovered. (This was a > >problem e.g. for parallel programs using MPI or the like.) > > When a core is about to go idle or first enters the idle state it will > search for the most heavily loaded core and steal a thread from it. The > core will only go to sleep if it can't find a non-running thread to > steal. > > If there are N cores and N+1 runnable threads, there is a long term load > balancer than runs periodically. It searches for the most and least > loaded cores and moves a thread from the former to the latter. That > prevents the same pair of threads from having to share the same core > indefinitely. > > There is an observed bug where a low priority thread can get pinned to a > particular core that is already occupied by a high-priority CPU-bound > thread that never releases the CPU. The low priority thread can't > migrate to another core that subsequently becomes available because it > it is pinned. It is not known how the thread originally got into this > state. I don't see any reason for 4BSD to be immune to this problem. > It is a well-known problem that an over-subscribed ULE kernel has much worse performance than a 4BSD kernel. I've posted more than once with benchmark numbers that demonstrate the problem. -- Steve ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Is kern.sched.preempt_thresh=0 a sensible default?
On 9 Jun, Stefan Esser wrote: > 3) Programs that evenly split the load on all available cores have been >suffering from sub-optimal assignment of threads to cores. E.g. on a >CPU with 8 (virtual) cores, this resulted in 6 cores running the load >in nominal time, 1 core taking twice as long because 2 threads were >scheduled to run on it, while 1 core was mostly idle. Even if the >load was initially evenly distributed, a woken up process that ran on >one core destroyed the symmetry and it was not recovered. (This was a >problem e.g. for parallel programs using MPI or the like.) When a core is about to go idle or first enters the idle state it will search for the most heavily loaded core and steal a thread from it. The core will only go to sleep if it can't find a non-running thread to steal. If there are N cores and N+1 runnable threads, there is a long term load balancer than runs periodically. It searches for the most and least loaded cores and moves a thread from the former to the latter. That prevents the same pair of threads from having to share the same core indefinitely. There is an observed bug where a low priority thread can get pinned to a particular core that is already occupied by a high-priority CPU-bound thread that never releases the CPU. The low priority thread can't migrate to another core that subsequently becomes available because it it is pinned. It is not known how the thread originally got into this state. I don't see any reason for 4BSD to be immune to this problem. > 4) The real time behavior of SCHED_ULE is weak due to interactive >processes (e.g. the X server) being put into the "time-share" class >and then suffering from the problems described as 1) or 2) above. >(You distinguish time-share and batch processes, which both are > allowed to consume their full quanta even of a higher priority > process in their class becomes runnable. I think this will not > give the required responsiveness e.g. for an X server.) >They should be considered I/O intensive, if they often don't use >their full quantum, without taking the significant amount of CPU >time they may use at times into account. (I.e. the criterion for >time-sharing should not be the CPU time consumed, but rather some >fraction of the quanta not being fully used due to voluntarily giving >up the CPU.) With many real-time threads it may be hard to identify >interactive threads, since they are non-voluntarily disrupted too >often - this must be considered in the sampling of voluntary vs. >non-voluntary context switches. It can actually be worse than this. There is a bug that can cause the wnck-applet component of the MATE desktop to consume a large amount of CPU time, and apparently it is communicating with the Xorg server, which it drives to 100% CPU. That makes it's PRI value increase greatly so it has a lower scheduling priority. Even without competing CPU load, interactive performance is hurt. With competing CPU load it gets much worse. ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Is kern.sched.preempt_thresh=0 a sensible default?
> On Fri, 8 Jun 2018 17:18:43 +0300 > Andriy Gapon wrote: > > > On 08/06/2018 15:27, Gary Jennejohn wrote: > > > On Thu, 7 Jun 2018 20:14:10 +0300 > > > Andriy Gapon wrote: > > > > > >> On 03/05/2018 12:41, Andriy Gapon wrote: > > >>> I think that we need preemption policies that might not be expressible > > >>> as one or > > >>> two numbers. A policy could be something like this: > > >>> - interrupt threads can preempt only threads from "lower" classes: > > >>> real-time, > > >>> kernel, timeshare, idle; > > >>> - interrupt threads cannot preempt other interrupt threads > > >>> - real-time threads can preempt other real-time threads and threads > > >>> from "lower" > > >>> classes: kernel, timeshare, idle > > >>> - kernel threads can preempt only threads from lower classes: > > >>> timeshare, idle > > >>> - interactive timeshare threads can only preempt batch and idle threads > > >>> - batch threads can only preempt idle threads > > >> > > >> > > >> Here is a sketch of the idea: https://reviews.freebsd.org/D15693 > > >> > > > > > > What about SCHED_4BSD? Or is this just an example and you chose > > > SCHED_ULE for it? > > > > I haven't looked at SCHED_4BSD code at all. > > > > I hope you will eventually because that's what I use. I find its > scheduling of interactive processes much better than ULE. +1 Bruce Evans may have some info and/or changes here too. -- Rod Grimes rgri...@freebsd.org ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Is kern.sched.preempt_thresh=0 a sensible default?
Am 07.06.18 um 19:14 schrieb Andriy Gapon: > On 03/05/2018 12:41, Andriy Gapon wrote: >> I think that we need preemption policies that might not be expressible as >> one or >> two numbers. A policy could be something like this: >> - interrupt threads can preempt only threads from "lower" classes: real-time, >> kernel, timeshare, idle; >> - interrupt threads cannot preempt other interrupt threads >> - real-time threads can preempt other real-time threads and threads from >> "lower" >> classes: kernel, timeshare, idle >> - kernel threads can preempt only threads from lower classes: timeshare, idle >> - interactive timeshare threads can only preempt batch and idle threads >> - batch threads can only preempt idle threads > > Here is a sketch of the idea: https://reviews.freebsd.org/D15693 Hi Andriy, I highly appreciate your effort to improve the scheduling in SCHED_ULE. But I'm afraid, that your scheme will not fix the problem. As you may know, there are a number of problems with SCHED_ULE, which let quite a number of users prefer SCHED_4BSD even on multi-core systems. The problems I'm aware of: 1) On UP systems, I/O intensive applications may be starved by compute intensive processes that are allowed to consume their full quantum of time (limiting reads to some 10 per second worst case). 2) Similarly, on SMP systems with load higher than the number of cores (virtual cores in case of HT), the compute bound cores can slow down a cp of a large file from 100s of MB/s to 100s of KB/s, under certain circumstances. 3) Programs that evenly split the load on all available cores have been suffering from sub-optimal assignment of threads to cores. E.g. on a CPU with 8 (virtual) cores, this resulted in 6 cores running the load in nominal time, 1 core taking twice as long because 2 threads were scheduled to run on it, while 1 core was mostly idle. Even if the load was initially evenly distributed, a woken up process that ran on one core destroyed the symmetry and it was not recovered. (This was a problem e.g. for parallel programs using MPI or the like.) 4) The real time behavior of SCHED_ULE is weak due to interactive processes (e.g. the X server) being put into the "time-share" class and then suffering from the problems described as 1) or 2) above. (You distinguish time-share and batch processes, which both are allowed to consume their full quanta even of a higher priority process in their class becomes runnable. I think this will not give the required responsiveness e.g. for an X server.) They should be considered I/O intensive, if they often don't use their full quantum, without taking the significant amount of CPU time they may use at times into account. (I.e. the criterion for time-sharing should not be the CPU time consumed, but rather some fraction of the quanta not being fully used due to voluntarily giving up the CPU.) With many real-time threads it may be hard to identify interactive threads, since they are non-voluntarily disrupted too often - this must be considered in the sampling of voluntary vs. non-voluntary context switches. 5) The NICE parameter has hardly any effect on the scheduling. Processes started with nice 19 get nearly the same share of the CPU as processes at nice 0, while they should traditionally only run when a core was idle, otherwise. Nice values between 0 and 19 have even less effect (hardly any). I have not had time to try the patch in that review, but I think that the cause of scheduling problems is not localized in that function. And a solution should be based on typical use cases or sample scenarios being applied to a scheduling policy. There are some easy cases (e.g. a "random" load of independent processes like a parallel make run), where only cache effects are relevant (try to keep a thread on its CPU as long as possible and, if interrupted, continue it on that CPU if you can assume there is still significant cached state). There have been excessive KTR traces that showed the scheduler behavior under specific loads, especially MPI, and there have been attempts to fix the uneven distribution of processes for that case (but AFAIR not with good success). Your patches may be part of the solution, with at least 3 other parts remaining: 1) The classification of interactive and time-share should be separate. Interactive means that the process does not use its full quantum in a non-negligible fraction of cases. The X server or a DBMS server should not be considered compute intensive, or request rates will be as low as 10 per second (if the time-share quantum is in the order of 100 ms). 2) The scheduling should guarantee symmetric distribution of the load for scenarios as parallel programs with MPI. Since OpenMP and other mechanism have similar requirements, this will become more relevant over time. 3) The nice-ness of a process should be relevant, to
Re: Is kern.sched.preempt_thresh=0 a sensible default?
On Fri, 8 Jun 2018 17:18:43 +0300 Andriy Gapon wrote: > On 08/06/2018 15:27, Gary Jennejohn wrote: > > On Thu, 7 Jun 2018 20:14:10 +0300 > > Andriy Gapon wrote: > > > >> On 03/05/2018 12:41, Andriy Gapon wrote: > >>> I think that we need preemption policies that might not be expressible as > >>> one or > >>> two numbers. A policy could be something like this: > >>> - interrupt threads can preempt only threads from "lower" classes: > >>> real-time, > >>> kernel, timeshare, idle; > >>> - interrupt threads cannot preempt other interrupt threads > >>> - real-time threads can preempt other real-time threads and threads from > >>> "lower" > >>> classes: kernel, timeshare, idle > >>> - kernel threads can preempt only threads from lower classes: timeshare, > >>> idle > >>> - interactive timeshare threads can only preempt batch and idle threads > >>> - batch threads can only preempt idle threads > >> > >> > >> Here is a sketch of the idea: https://reviews.freebsd.org/D15693 > >> > > > > What about SCHED_4BSD? Or is this just an example and you chose > > SCHED_ULE for it? > > I haven't looked at SCHED_4BSD code at all. > I hope you will eventually because that's what I use. I find its scheduling of interactive processes much better than ULE. -- Gary Jennejohn ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Is kern.sched.preempt_thresh=0 a sensible default?
On 08/06/2018 15:27, Gary Jennejohn wrote: > On Thu, 7 Jun 2018 20:14:10 +0300 > Andriy Gapon wrote: > >> On 03/05/2018 12:41, Andriy Gapon wrote: >>> I think that we need preemption policies that might not be expressible as >>> one or >>> two numbers. A policy could be something like this: >>> - interrupt threads can preempt only threads from "lower" classes: >>> real-time, >>> kernel, timeshare, idle; >>> - interrupt threads cannot preempt other interrupt threads >>> - real-time threads can preempt other real-time threads and threads from >>> "lower" >>> classes: kernel, timeshare, idle >>> - kernel threads can preempt only threads from lower classes: timeshare, >>> idle >>> - interactive timeshare threads can only preempt batch and idle threads >>> - batch threads can only preempt idle threads >> >> >> Here is a sketch of the idea: https://reviews.freebsd.org/D15693 >> > > What about SCHED_4BSD? Or is this just an example and you chose > SCHED_ULE for it? I haven't looked at SCHED_4BSD code at all. -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Is kern.sched.preempt_thresh=0 a sensible default?
On Thu, 7 Jun 2018 20:14:10 +0300 Andriy Gapon wrote: > On 03/05/2018 12:41, Andriy Gapon wrote: > > I think that we need preemption policies that might not be expressible as > > one or > > two numbers. A policy could be something like this: > > - interrupt threads can preempt only threads from "lower" classes: > > real-time, > > kernel, timeshare, idle; > > - interrupt threads cannot preempt other interrupt threads > > - real-time threads can preempt other real-time threads and threads from > > "lower" > > classes: kernel, timeshare, idle > > - kernel threads can preempt only threads from lower classes: timeshare, > > idle > > - interactive timeshare threads can only preempt batch and idle threads > > - batch threads can only preempt idle threads > > > Here is a sketch of the idea: https://reviews.freebsd.org/D15693 > What about SCHED_4BSD? Or is this just an example and you chose SCHED_ULE for it? -- Gary Jennejohn ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Is kern.sched.preempt_thresh=0 a sensible default?
On 03/05/2018 12:41, Andriy Gapon wrote: > I think that we need preemption policies that might not be expressible as one > or > two numbers. A policy could be something like this: > - interrupt threads can preempt only threads from "lower" classes: real-time, > kernel, timeshare, idle; > - interrupt threads cannot preempt other interrupt threads > - real-time threads can preempt other real-time threads and threads from > "lower" > classes: kernel, timeshare, idle > - kernel threads can preempt only threads from lower classes: timeshare, idle > - interactive timeshare threads can only preempt batch and idle threads > - batch threads can only preempt idle threads Here is a sketch of the idea: https://reviews.freebsd.org/D15693 -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Is kern.sched.preempt_thresh=0 a sensible default?
Am 04.04.18 um 18:45 schrieb Andriy Gapon: > On 04/04/2018 16:19, Stefan Esser wrote: >> I have identified the cause of the extremely low I/O performance (2 to 6 read >> operations scheduled per second). >> >> The default value of kern.sched.preempt_thresh=0 does not give any CPU to the >> I/O bound process unless a (long) time slice expires >> (kern.sched.quantum=94488 >> on my system with HZ=1000) or one of the CPU bound processes voluntarily >> gives >> up the CPU (or exits). >> >> Any non-zero value of preemt_thresh lets the system perform I/O in parallel >> with the CPU bound processes, again. > > Let me guess... you have a custom kernel configuration and, unlike GENERIC > (assuming x86), it does not have 'options PREEMPTION'? Yes, thank you for pointing that out!!! I used to have PREEMPTION and FULL_PREEMPTION in my kernel configuration, and apparently have deleted both options when only FULL_PREEMPTION was supposed to go ... After looking at sched_ule.c and top/machine.c it appears, that the value of preempt_thresh corresponds to the PRI value as shown by top (or ps -l) plus PZERO which is calculated as (PRI_MIN_KERN=80) + 20. What I do not understand, though, is that the decision about a preemption is only based on the calculated new priority of the thread, but not at all on the priority of other running threads (except the idle thread). On my system, a "real" batch job (i.e. one that does not voluntarily give up the CPU due to I/O) seems to have a PRI value of 80 to 100 (growing over time), while an interactive process has a PRI of 20, a maximally "niced" interactive process has 52. So, I'd expect a reasonable default value of preempt_thresh to be slightly above 120 (e.g. 124) to prevent I/O heavy threads from stealing each other the CPU too often, and to prevent "niced" processes from doing the same ... The two values configured into the kernel (80 for PREEMPTION and 255 for FULL_PREEMPTION) seem to be extremes, but something in between (e.g. 124) is not offered (can only be configured via sysctl without any information for the correspondence between the threshold value and the PRI value in any document I've found, besides the kernel sources ...). Is PRI_MIN_KERN=80 really a good default value for the preemption threshold? Regards, STefan ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
Re: Is kern.sched.preempt_thresh=0 a sensible default?
On 04/04/2018 16:19, Stefan Esser wrote: > I have identified the cause of the extremely low I/O performance (2 to 6 read > operations scheduled per second). > > The default value of kern.sched.preempt_thresh=0 does not give any CPU to the > I/O bound process unless a (long) time slice expires (kern.sched.quantum=94488 > on my system with HZ=1000) or one of the CPU bound processes voluntarily gives > up the CPU (or exits). > > Any non-zero value of preemt_thresh lets the system perform I/O in parallel > with the CPU bound processes, again. Let me guess... you have a custom kernel configuration and, unlike GENERIC (assuming x86), it does not have 'options PREEMPTION'? -- Andriy Gapon ___ freebsd-current@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"