Re: ule+smp: small optimization for turnstile priority lending
On 07.11.2012 08:45, David Xu wrote: On 2012/11/07 14:17, Jeff Roberson wrote: On Wed, 7 Nov 2012, David Xu wrote: On 2012/11/06 19:03, Attilio Rao wrote: On 9/20/12, David Xu davi...@freebsd.org wrote: I found another scenario in taskqueue, in the function taskqueue_terminate, current thread tries to wake another thread up and sleep immediately, the tq_mutex sometimes is a spinlock. So if you remove one thread load from current cpu before wakeup, the resumed thread may be put on same cpu, so it will optimize the cpu scheduling too. I think that in order to fit with sched_add() modifies I have in mind (see other patches within this thread) wakeup() should grow a new argument, or maybe we can use wakeup_flags() new KPI. If the latter is the case, I would also propose to let wakeup_one() to be absorbed into wakeup_flags() with its own flag. Yes, I like the idea. From ~2007 http://people.freebsd.org/~jeff/wakeupflags.diff This has some different optimizations that can be done when you have a hinted wakeup. wakeup_flags is a good start point. But what flags should be supported? I think WAKEUP_WILLSLEEP may be needed if the current thread will switch out as soon as possible. WAKEUP_ONE is important for accept sockets. We don't want all idle threads to compete and contend on a new connection. -- Andre I see you have added WAKEUP_LOCAL and WAKEUP_TAIL. Are they for cache optimization ? both of them are arguable. WAKEUP_LOCAL increases cpu migration, not good, since one benefit of ULE is keeping thread on its old cpu, the WAKEUP_LOCAL violates the design. WAKEUP_LOCAL used in pipe code may not be correct if the pipe only has few of bytes to be transfered or receiver only eats a few bytes each time, so why bother to move other threads to local cpu ? If the other threads have many bytes cached in their old cpu, this migration is expensive. WAKEUP_TAIL is a good idea, I guess that you want to wake up hot-thread its code and data are still in its old cpu's cache. But this will lead to unfair. I'd like the sleep queue to implement a policy like I did in libthr, it picks a thread at tail of queue in a fixed percentage, it does have some level of unfair but not fatal at all. Thanks, Jeff ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
On Wed, 7 Nov 2012, David Xu wrote: On 2012/11/06 19:03, Attilio Rao wrote: On 9/20/12, David Xu davi...@freebsd.org wrote: On 2012/09/18 22:05, Andriy Gapon wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. Here is a proposed solution: turnstile_wait: optimize priority lending to a thread on a runqueue As the current thread is definitely going into mi_switch, it now removes its load before doing priority propagation which can potentially result in sched_add. In the SMP ULE case the latter searches for the least loaded CPU to place a boosted thread, which is supposedly about to run. diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c index 8e466cd..3299cae 100644 --- a/sys/kern/sched_ule.c +++ b/sys/kern/sched_ule.c @@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread *newtd, int flags) /* This thread must be going to sleep. */ TDQ_LOCK(tdq); mtx = thread_lock_block(td); - tdq_load_rem(tdq, td); +#if defined(SMP) + if ((flags SW_TYPE_MASK) != SWT_TURNSTILE) +#endif + tdq_load_rem(tdq, td); } /* * We enter here with the thread blocked and assigned to the @@ -2412,6 +2415,21 @@ sched_rem(struct thread *td) tdq_setlowpri(tdq, NULL); } +void +sched_load_rem(struct thread *td) +{ + struct tdq *tdq; + + KASSERT(td == curthread, + (sched_rem_load: only curthread is supported)); + KASSERT(td-td_oncpu == td-td_sched-ts_cpu, + (thread running on cpu different from ts_cpu)); + tdq = TDQ_CPU(td-td_sched-ts_cpu); + TDQ_LOCK_ASSERT(tdq, MA_OWNED); + MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); + tdq_load_rem(tdq, td); +} + /* * Fetch cpu utilization information. Updates on demand. */ diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c index 31d16fe..d1d68e9 100644 --- a/sys/kern/subr_turnstile.c +++ b/sys/kern/subr_turnstile.c @@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread *owner, int queue) LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash); } thread_lock(td); +#if defined(SCHED_ULE) defined(SMP) + /* +* Remove load earlier so that it does not affect cpu selection +* for a thread waken up due to priority lending, if any. +*/ + sched_load_rem(td); +#endif thread_lock_set(td, ts-ts_lock); td-td_turnstile = NULL; diff --git a/sys/sys/sched.h b/sys/sys/sched.h index 4b8387c..b1ead1b 100644 --- a/sys/sys/sched.h +++ b/sys/sys/sched.h @@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td); void sched_add(struct thread *td, int flags); void sched_clock(struct thread *td); void sched_rem(struct thread *td); +#if defined(SCHED_ULE) defined(SMP) +void sched_load_rem(struct thread *td); +#endif void sched_tick(int cnt); void sched_relinquish(struct thread *td); struct thread *sched_choose(void); I found another scenario in taskqueue, in the function taskqueue_terminate, current thread tries to wake another thread up and sleep immediately, the tq_mutex sometimes is a spinlock. So if you remove one thread load from current cpu before wakeup, the resumed thread may be put on same cpu, so it
Re: ule+smp: small optimization for turnstile priority lending
On 10/29/12, Attilio Rao atti...@freebsd.org wrote: On Mon, Oct 29, 2012 at 4:56 PM, Attilio Rao atti...@freebsd.org wrote: On Wed, Oct 3, 2012 at 3:05 PM, Andriy Gapon a...@freebsd.org wrote: on 20/09/2012 16:14 Attilio Rao said the following: On 9/20/12, Andriy Gapon a...@freebsd.org wrote: [snip] The patch works well as far as I can tell. Thank you! There is one warning with full witness enables but it appears to be harmless (so far): Andriy, thanks a lot for your testing and reports you made so far. Unfortunately I'm going off for 2 weeks now and I won't work on FreeBSD for that timeframe. I will get back to those in 2 weeks then. If you want to play more with this idea feel free to extend/fix/etc. this patch. Unfortunately I haven't found time to work on this further, but I have some additional thoughts. First, I think that the witness message was not benign and it actually warns about the same kind of deadlock that I originally had. The problem is that sched_lend_prio() is called with target thread's td_lock held, which is a lock of tdq on the thread's CPU. Then, in your patch, we acquire current tdq's lock to modify its load. But if two tdq locks are to be held at the same time, then they must be locked using tdq_lock_pair, so that lock order is maintained. With the patch no tdq lock order can be maintained (can arbitrary) and thus it is possible to run into a deadlock. Indeed. I realized this after thinking about this problem while I was on holiday. I see two possibilities here, but don't rule out that there can be more. 1. Do something like my patch does. That is, manipulate current thread's tdq in advance before any other thread/tdq locks are acquired (or td_lock is changed to point to a different lock and current tdq is unlocked). The API can be made more generic in nature. E.g. it can look like this: void sched_thread_block(struct thread *td, int inhibitor) { struct tdq *tdq; THREAD_LOCK_ASSERT(td, MA_OWNED); KASSERT(td == curthread, (sched_thread_block: only curthread is supported)); tdq = TDQ_SELF(); TDQ_LOCK_ASSERT(tdq, MA_OWNED); MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); TD_SET_INHIB(td, inhibitor); tdq_load_rem(tdq, td); tdq_setlowpri(tdq, td); } 2. Try to do things from sched_lend_prio based on curthread's state. This, as it seems, would require completely lock-less manipulations of current tdq. E.g. for tdq_load we could use atomic operations (it is already accessed locklessly, but not modified). But for tdq_lowpri a more elaborate trick would be required like having a separate field for a temporary value. No, this is not going to work because tdq_lowpri and tdq_load are updated and used in the same critical path (and possibly also other tdq* members in tdq_runqueue_add(), for example, but I didn't bother to also check that). Ok, so here I wasn't completely right because td_lowpri and tdq_load are not read in the same critical path (cpu_search() and sched_pickcpu() in the self case). However, I was over-estimating a factor in this patch: I don't think we need to fix real tdq_load for self in that case before cpu_search() because self is handled in a very special way, with its own comparison. I then think that a patch local to sched_pickcpu() on the self path should be enough. This brings us to another bigger issue, however. tdq_lowpri handling is not perfect in that patch. Think about the case where the thread to be switched out is, infact, the lowest priority one. In that case, what happens is that what we should do is recalculating tdq_lowpri which is usually a very expensive operation and requires TDQ self locking to be in place to be brought on correctly. I think that the easy fix for this is to implement a second field called tdq_sec_lowpri where we store the next thread to be scheduled, after tdq_lowpri. This is going to not really change anything because we will get it for free when we already calculate tdq_lowpri and it will let us the implement a definitive logic that helps in the case you describe. However I still want to think about the base idea behind the current algorithm in the self/likely cpu pickup. Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
On 9/20/12, David Xu davi...@freebsd.org wrote: On 2012/09/18 22:05, Andriy Gapon wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. Here is a proposed solution: turnstile_wait: optimize priority lending to a thread on a runqueue As the current thread is definitely going into mi_switch, it now removes its load before doing priority propagation which can potentially result in sched_add. In the SMP ULE case the latter searches for the least loaded CPU to place a boosted thread, which is supposedly about to run. diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c index 8e466cd..3299cae 100644 --- a/sys/kern/sched_ule.c +++ b/sys/kern/sched_ule.c @@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread *newtd, int flags) /* This thread must be going to sleep. */ TDQ_LOCK(tdq); mtx = thread_lock_block(td); -tdq_load_rem(tdq, td); +#if defined(SMP) +if ((flags SW_TYPE_MASK) != SWT_TURNSTILE) +#endif +tdq_load_rem(tdq, td); } /* * We enter here with the thread blocked and assigned to the @@ -2412,6 +2415,21 @@ sched_rem(struct thread *td) tdq_setlowpri(tdq, NULL); } +void +sched_load_rem(struct thread *td) +{ +struct tdq *tdq; + +KASSERT(td == curthread, +(sched_rem_load: only curthread is supported)); +KASSERT(td-td_oncpu == td-td_sched-ts_cpu, +(thread running on cpu different from ts_cpu)); +tdq = TDQ_CPU(td-td_sched-ts_cpu); +TDQ_LOCK_ASSERT(tdq, MA_OWNED); +MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); +tdq_load_rem(tdq, td); +} + /* * Fetch cpu utilization information. Updates on demand. */ diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c index 31d16fe..d1d68e9 100644 --- a/sys/kern/subr_turnstile.c +++ b/sys/kern/subr_turnstile.c @@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread *owner, int queue) LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash); } thread_lock(td); +#if defined(SCHED_ULE) defined(SMP) +/* + * Remove load earlier so that it does not affect cpu selection + * for a thread waken up due to priority lending, if any. + */ +sched_load_rem(td); +#endif thread_lock_set(td, ts-ts_lock); td-td_turnstile = NULL; diff --git a/sys/sys/sched.h b/sys/sys/sched.h index 4b8387c..b1ead1b 100644 --- a/sys/sys/sched.h +++ b/sys/sys/sched.h @@ -110,6 +110,9 @@ void sched_preempt(struct thread *td); void sched_add(struct thread *td, int flags); void sched_clock(struct thread *td); void sched_rem(struct thread *td); +#if defined(SCHED_ULE) defined(SMP) +voidsched_load_rem(struct thread *td); +#endif void sched_tick(int cnt); void sched_relinquish(struct thread *td); struct thread *sched_choose(void); I found another scenario in taskqueue, in the function taskqueue_terminate, current thread tries to wake another thread up and sleep immediately, the tq_mutex sometimes is a spinlock. So if you remove one thread load from current cpu before wakeup, the resumed thread may be put on same cpu, so it will optimize the cpu
Re: ule+smp: small optimization for turnstile priority lending
On 2012/11/06 19:03, Attilio Rao wrote: On 9/20/12, David Xu davi...@freebsd.org wrote: On 2012/09/18 22:05, Andriy Gapon wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. Here is a proposed solution: turnstile_wait: optimize priority lending to a thread on a runqueue As the current thread is definitely going into mi_switch, it now removes its load before doing priority propagation which can potentially result in sched_add. In the SMP ULE case the latter searches for the least loaded CPU to place a boosted thread, which is supposedly about to run. diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c index 8e466cd..3299cae 100644 --- a/sys/kern/sched_ule.c +++ b/sys/kern/sched_ule.c @@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread *newtd, int flags) /* This thread must be going to sleep. */ TDQ_LOCK(tdq); mtx = thread_lock_block(td); - tdq_load_rem(tdq, td); +#if defined(SMP) + if ((flags SW_TYPE_MASK) != SWT_TURNSTILE) +#endif + tdq_load_rem(tdq, td); } /* * We enter here with the thread blocked and assigned to the @@ -2412,6 +2415,21 @@ sched_rem(struct thread *td) tdq_setlowpri(tdq, NULL); } +void +sched_load_rem(struct thread *td) +{ + struct tdq *tdq; + + KASSERT(td == curthread, + (sched_rem_load: only curthread is supported)); + KASSERT(td-td_oncpu == td-td_sched-ts_cpu, + (thread running on cpu different from ts_cpu)); + tdq = TDQ_CPU(td-td_sched-ts_cpu); + TDQ_LOCK_ASSERT(tdq, MA_OWNED); + MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); + tdq_load_rem(tdq, td); +} + /* * Fetch cpu utilization information. Updates on demand. */ diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c index 31d16fe..d1d68e9 100644 --- a/sys/kern/subr_turnstile.c +++ b/sys/kern/subr_turnstile.c @@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread *owner, int queue) LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash); } thread_lock(td); +#if defined(SCHED_ULE) defined(SMP) + /* +* Remove load earlier so that it does not affect cpu selection +* for a thread waken up due to priority lending, if any. +*/ + sched_load_rem(td); +#endif thread_lock_set(td, ts-ts_lock); td-td_turnstile = NULL; diff --git a/sys/sys/sched.h b/sys/sys/sched.h index 4b8387c..b1ead1b 100644 --- a/sys/sys/sched.h +++ b/sys/sys/sched.h @@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td); void sched_add(struct thread *td, int flags); void sched_clock(struct thread *td); void sched_rem(struct thread *td); +#if defined(SCHED_ULE) defined(SMP) +void sched_load_rem(struct thread *td); +#endif void sched_tick(int cnt); void sched_relinquish(struct thread *td); struct thread *sched_choose(void); I found another scenario in taskqueue, in the function taskqueue_terminate, current thread tries to wake another thread up and sleep immediately, the tq_mutex sometimes is a spinlock. So if you remove one thread load from current cpu before wakeup, the resumed thread may be put on same cpu, so it will optimize the cpu scheduling too. I
Re: ule+smp: small optimization for turnstile priority lending
On 2012/11/07 14:17, Jeff Roberson wrote: On Wed, 7 Nov 2012, David Xu wrote: On 2012/11/06 19:03, Attilio Rao wrote: On 9/20/12, David Xu davi...@freebsd.org wrote: On 2012/09/18 22:05, Andriy Gapon wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. Here is a proposed solution: turnstile_wait: optimize priority lending to a thread on a runqueue As the current thread is definitely going into mi_switch, it now removes its load before doing priority propagation which can potentially result in sched_add. In the SMP ULE case the latter searches for the least loaded CPU to place a boosted thread, which is supposedly about to run. diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c index 8e466cd..3299cae 100644 --- a/sys/kern/sched_ule.c +++ b/sys/kern/sched_ule.c @@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread *newtd, int flags) /* This thread must be going to sleep. */ TDQ_LOCK(tdq); mtx = thread_lock_block(td); -tdq_load_rem(tdq, td); +#if defined(SMP) +if ((flags SW_TYPE_MASK) != SWT_TURNSTILE) +#endif +tdq_load_rem(tdq, td); } /* * We enter here with the thread blocked and assigned to the @@ -2412,6 +2415,21 @@ sched_rem(struct thread *td) tdq_setlowpri(tdq, NULL); } +void +sched_load_rem(struct thread *td) +{ +struct tdq *tdq; + +KASSERT(td == curthread, +(sched_rem_load: only curthread is supported)); +KASSERT(td-td_oncpu == td-td_sched-ts_cpu, +(thread running on cpu different from ts_cpu)); +tdq = TDQ_CPU(td-td_sched-ts_cpu); +TDQ_LOCK_ASSERT(tdq, MA_OWNED); +MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); +tdq_load_rem(tdq, td); +} + /* * Fetch cpu utilization information. Updates on demand. */ diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c index 31d16fe..d1d68e9 100644 --- a/sys/kern/subr_turnstile.c +++ b/sys/kern/subr_turnstile.c @@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread *owner, int queue) LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash); } thread_lock(td); +#if defined(SCHED_ULE) defined(SMP) +/* + * Remove load earlier so that it does not affect cpu selection + * for a thread waken up due to priority lending, if any. + */ +sched_load_rem(td); +#endif thread_lock_set(td, ts-ts_lock); td-td_turnstile = NULL; diff --git a/sys/sys/sched.h b/sys/sys/sched.h index 4b8387c..b1ead1b 100644 --- a/sys/sys/sched.h +++ b/sys/sys/sched.h @@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td); voidsched_add(struct thread *td, int flags); voidsched_clock(struct thread *td); voidsched_rem(struct thread *td); +#if defined(SCHED_ULE) defined(SMP) +voidsched_load_rem(struct thread *td); +#endif voidsched_tick(int cnt); voidsched_relinquish(struct thread *td); struct thread *sched_choose(void); I found another scenario in taskqueue, in the function taskqueue_terminate, current thread tries to wake another thread up and sleep immediately, the tq_mutex sometimes is a spinlock. So if you remove one thread load from current cpu before wakeup, the resumed thread may be put on same cpu, so it will optimize the cpu scheduling too. I think
Re: ule+smp: small optimization for turnstile priority lending
On Wed, Oct 3, 2012 at 3:05 PM, Andriy Gapon a...@freebsd.org wrote: on 20/09/2012 16:14 Attilio Rao said the following: On 9/20/12, Andriy Gapon a...@freebsd.org wrote: [snip] The patch works well as far as I can tell. Thank you! There is one warning with full witness enables but it appears to be harmless (so far): Andriy, thanks a lot for your testing and reports you made so far. Unfortunately I'm going off for 2 weeks now and I won't work on FreeBSD for that timeframe. I will get back to those in 2 weeks then. If you want to play more with this idea feel free to extend/fix/etc. this patch. Unfortunately I haven't found time to work on this further, but I have some additional thoughts. First, I think that the witness message was not benign and it actually warns about the same kind of deadlock that I originally had. The problem is that sched_lend_prio() is called with target thread's td_lock held, which is a lock of tdq on the thread's CPU. Then, in your patch, we acquire current tdq's lock to modify its load. But if two tdq locks are to be held at the same time, then they must be locked using tdq_lock_pair, so that lock order is maintained. With the patch no tdq lock order can be maintained (can arbitrary) and thus it is possible to run into a deadlock. Indeed. I realized this after thinking about this problem while I was on holiday. I see two possibilities here, but don't rule out that there can be more. 1. Do something like my patch does. That is, manipulate current thread's tdq in advance before any other thread/tdq locks are acquired (or td_lock is changed to point to a different lock and current tdq is unlocked). The API can be made more generic in nature. E.g. it can look like this: void sched_thread_block(struct thread *td, int inhibitor) { struct tdq *tdq; THREAD_LOCK_ASSERT(td, MA_OWNED); KASSERT(td == curthread, (sched_thread_block: only curthread is supported)); tdq = TDQ_SELF(); TDQ_LOCK_ASSERT(tdq, MA_OWNED); MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); TD_SET_INHIB(td, inhibitor); tdq_load_rem(tdq, td); tdq_setlowpri(tdq, td); } 2. Try to do things from sched_lend_prio based on curthread's state. This, as it seems, would require completely lock-less manipulations of current tdq. E.g. for tdq_load we could use atomic operations (it is already accessed locklessly, but not modified). But for tdq_lowpri a more elaborate trick would be required like having a separate field for a temporary value. No, this is not going to work because tdq_lowpri and tdq_load are updated and used in the same critical path (and possibly also other tdq* members in tdq_runqueue_add(), for example, but I didn't bother to also check that). Let me think some more about this and get back to you. Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
On Mon, Oct 29, 2012 at 4:56 PM, Attilio Rao atti...@freebsd.org wrote: On Wed, Oct 3, 2012 at 3:05 PM, Andriy Gapon a...@freebsd.org wrote: on 20/09/2012 16:14 Attilio Rao said the following: On 9/20/12, Andriy Gapon a...@freebsd.org wrote: [snip] The patch works well as far as I can tell. Thank you! There is one warning with full witness enables but it appears to be harmless (so far): Andriy, thanks a lot for your testing and reports you made so far. Unfortunately I'm going off for 2 weeks now and I won't work on FreeBSD for that timeframe. I will get back to those in 2 weeks then. If you want to play more with this idea feel free to extend/fix/etc. this patch. Unfortunately I haven't found time to work on this further, but I have some additional thoughts. First, I think that the witness message was not benign and it actually warns about the same kind of deadlock that I originally had. The problem is that sched_lend_prio() is called with target thread's td_lock held, which is a lock of tdq on the thread's CPU. Then, in your patch, we acquire current tdq's lock to modify its load. But if two tdq locks are to be held at the same time, then they must be locked using tdq_lock_pair, so that lock order is maintained. With the patch no tdq lock order can be maintained (can arbitrary) and thus it is possible to run into a deadlock. Indeed. I realized this after thinking about this problem while I was on holiday. I see two possibilities here, but don't rule out that there can be more. 1. Do something like my patch does. That is, manipulate current thread's tdq in advance before any other thread/tdq locks are acquired (or td_lock is changed to point to a different lock and current tdq is unlocked). The API can be made more generic in nature. E.g. it can look like this: void sched_thread_block(struct thread *td, int inhibitor) { struct tdq *tdq; THREAD_LOCK_ASSERT(td, MA_OWNED); KASSERT(td == curthread, (sched_thread_block: only curthread is supported)); tdq = TDQ_SELF(); TDQ_LOCK_ASSERT(tdq, MA_OWNED); MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); TD_SET_INHIB(td, inhibitor); tdq_load_rem(tdq, td); tdq_setlowpri(tdq, td); } 2. Try to do things from sched_lend_prio based on curthread's state. This, as it seems, would require completely lock-less manipulations of current tdq. E.g. for tdq_load we could use atomic operations (it is already accessed locklessly, but not modified). But for tdq_lowpri a more elaborate trick would be required like having a separate field for a temporary value. No, this is not going to work because tdq_lowpri and tdq_load are updated and used in the same critical path (and possibly also other tdq* members in tdq_runqueue_add(), for example, but I didn't bother to also check that). Ok, so here I wasn't completely right because td_lowpri and tdq_load are not read in the same critical path (cpu_search() and sched_pickcpu() in the self case). However, I was over-estimating a factor in this patch: I don't think we need to fix real tdq_load for self in that case before cpu_search() because self is handled in a very special way, with its own comparison. I then think that a patch local to sched_pickcpu() on the self path should be enough. This brings us to another bigger issue, however. tdq_lowpri handling is not perfect in that patch. Think about the case where the thread to be switched out is, infact, the lowest priority one. In that case, what happens is that what we should do is recalculating tdq_lowpri which is usually a very expensive operation and requires TDQ self locking to be in place to be brought on correctly. At this point, I would suggest to use the new version of avg_ule.patch which doesn't take into account tdq_lowpri. This means the optimization won't work in all the cases but at least it is a good trade-off with a more hacky version. This also let me wonder about the tdq_lowpri check in the self case, in general. Basically it forces sched_pickcpu() to select self if and only if the new thread to schedule has an higher priority than the lowest one on curcpu. Why is that like this? Exactly this check is used to enforce some sort of fairness? It would be good if Jeff spends a word or two on this check specifically. Anyway the patch that implements what suggested, let me know your thinking about it: http://www.freebsd.org/~attilio/avg_ule2.patch Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
On Mon, Oct 29, 2012 at 6:59 PM, Attilio Rao atti...@freebsd.org wrote: [ trimm ] This also let me wonder about the tdq_lowpri check in the self case, in general. Basically it forces sched_pickcpu() to select self if and only if the new thread to schedule has an higher priority than the lowest one on curcpu. Why is that like this? Exactly this check is used to enforce some sort of fairness? It would be good if Jeff spends a word or two on this check specifically. Also, I've read the code of tdq_setlowpri() more closely and I think the name tdq_lowpri is highly misleading, because that seems to me the *highest* priority thread that gets returned. Said that, this means that self will be selected in sched_pickcpu() if and only if the new thread has an higher priority than all the ones on the self runqueue. Isn't all this a bit too extreme? Assuming that one cpu has only a single high-priority thread and others are very loaded it would likely make sense to not keep loading them but switch to the self one, maybe. Anyway the patch that implements what suggested, let me know your thinking about it: http://www.freebsd.org/~attilio/avg_ule2.patch I was thinking that however, maybe we could do the tdq_choose() calculation if self == target, to have a little more chances to get the optimization in place, eventually. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
on 20/09/2012 16:14 Attilio Rao said the following: On 9/20/12, Andriy Gapon a...@freebsd.org wrote: [snip] The patch works well as far as I can tell. Thank you! There is one warning with full witness enables but it appears to be harmless (so far): Andriy, thanks a lot for your testing and reports you made so far. Unfortunately I'm going off for 2 weeks now and I won't work on FreeBSD for that timeframe. I will get back to those in 2 weeks then. If you want to play more with this idea feel free to extend/fix/etc. this patch. Unfortunately I haven't found time to work on this further, but I have some additional thoughts. First, I think that the witness message was not benign and it actually warns about the same kind of deadlock that I originally had. The problem is that sched_lend_prio() is called with target thread's td_lock held, which is a lock of tdq on the thread's CPU. Then, in your patch, we acquire current tdq's lock to modify its load. But if two tdq locks are to be held at the same time, then they must be locked using tdq_lock_pair, so that lock order is maintained. With the patch no tdq lock order can be maintained (can arbitrary) and thus it is possible to run into a deadlock. I see two possibilities here, but don't rule out that there can be more. 1. Do something like my patch does. That is, manipulate current thread's tdq in advance before any other thread/tdq locks are acquired (or td_lock is changed to point to a different lock and current tdq is unlocked). The API can be made more generic in nature. E.g. it can look like this: void sched_thread_block(struct thread *td, int inhibitor) { struct tdq *tdq; THREAD_LOCK_ASSERT(td, MA_OWNED); KASSERT(td == curthread, (sched_thread_block: only curthread is supported)); tdq = TDQ_SELF(); TDQ_LOCK_ASSERT(tdq, MA_OWNED); MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); TD_SET_INHIB(td, inhibitor); tdq_load_rem(tdq, td); tdq_setlowpri(tdq, td); } 2. Try to do things from sched_lend_prio based on curthread's state. This, as it seems, would require completely lock-less manipulations of current tdq. E.g. for tdq_load we could use atomic operations (it is already accessed locklessly, but not modified). But for tdq_lowpri a more elaborate trick would be required like having a separate field for a temporary value. In any case, I'll have to revisit this later. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
On 2012/09/18 22:05, Andriy Gapon wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. Here is a proposed solution: turnstile_wait: optimize priority lending to a thread on a runqueue As the current thread is definitely going into mi_switch, it now removes its load before doing priority propagation which can potentially result in sched_add. In the SMP ULE case the latter searches for the least loaded CPU to place a boosted thread, which is supposedly about to run. diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c index 8e466cd..3299cae 100644 --- a/sys/kern/sched_ule.c +++ b/sys/kern/sched_ule.c @@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread *newtd, int flags) /* This thread must be going to sleep. */ TDQ_LOCK(tdq); mtx = thread_lock_block(td); - tdq_load_rem(tdq, td); +#if defined(SMP) + if ((flags SW_TYPE_MASK) != SWT_TURNSTILE) +#endif + tdq_load_rem(tdq, td); } /* * We enter here with the thread blocked and assigned to the @@ -2412,6 +2415,21 @@ sched_rem(struct thread *td) tdq_setlowpri(tdq, NULL); } +void +sched_load_rem(struct thread *td) +{ + struct tdq *tdq; + + KASSERT(td == curthread, + (sched_rem_load: only curthread is supported)); + KASSERT(td-td_oncpu == td-td_sched-ts_cpu, + (thread running on cpu different from ts_cpu)); + tdq = TDQ_CPU(td-td_sched-ts_cpu); + TDQ_LOCK_ASSERT(tdq, MA_OWNED); + MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); + tdq_load_rem(tdq, td); +} + /* * Fetch cpu utilization information. Updates on demand. */ diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c index 31d16fe..d1d68e9 100644 --- a/sys/kern/subr_turnstile.c +++ b/sys/kern/subr_turnstile.c @@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread *owner, int queue) LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash); } thread_lock(td); +#if defined(SCHED_ULE) defined(SMP) + /* +* Remove load earlier so that it does not affect cpu selection +* for a thread waken up due to priority lending, if any. +*/ + sched_load_rem(td); +#endif thread_lock_set(td, ts-ts_lock); td-td_turnstile = NULL; diff --git a/sys/sys/sched.h b/sys/sys/sched.h index 4b8387c..b1ead1b 100644 --- a/sys/sys/sched.h +++ b/sys/sys/sched.h @@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td); void sched_add(struct thread *td, int flags); void sched_clock(struct thread *td); void sched_rem(struct thread *td); +#if defined(SCHED_ULE) defined(SMP) +void sched_load_rem(struct thread *td); +#endif void sched_tick(int cnt); void sched_relinquish(struct thread *td); struct thread *sched_choose(void); I found another scenario in taskqueue, in the function taskqueue_terminate, current thread tries to wake another thread up and sleep immediately, the tq_mutex sometimes is a spinlock. So if you remove one thread load from current cpu before wakeup, the resumed thread may be put on same cpu, so it will optimize the cpu scheduling too. /* * Signal a taskqueue thread to terminate. */ static void taskqueue_terminate(struct thread **pp,
Re: ule+smp: small optimization for turnstile priority lending
on 19/09/2012 10:33 Attilio Rao said the following: It is hard for me to tell if this is subject to same issues because I do not entirely understand where the locking was happening in your patch. Can you try testing this with your already KTR instrumented test and possibly report? The patch works well as far as I can tell. Thank you! There is one warning with full witness enables but it appears to be harmless (so far): Acquiring duplicate lock of same type: sched lock 1st sched lock 1 @ /usr/src/sys/kern/subr_turnstile.c:212 2nd sched lock 0 @ /usr/src/sys/kern/sched_ule.c:1244 KDB: stack backtrace: db_trace_self_wrapper() at 0x802d238a = db_trace_self_wrapper+0x2a kdb_backtrace() at 0x80555f4a = kdb_backtrace+0x3a _witness_debugger() at 0x8056e2bc = _witness_debugger+0x2c witness_checkorder() at 0x8056f759 = witness_checkorder+0x579 _mtx_lock_spin_flags() at 0x80504bcd = _mtx_lock_spin_flags+0x10d sched_pickcpu() at 0x80547829 = sched_pickcpu+0x199 sched_add() at 0x80547bdb = sched_add+0x14b sched_thread_priority() at 0x80548059 = sched_thread_priority+0x1c9 sched_lend_prio() at 0x80548344 = sched_lend_prio+0x14 propagate_priority() at 0x8056801e = propagate_priority+0x1ce turnstile_wait() at 0x8056959f = turnstile_wait+0x4ef _mtx_lock_sleep() at 0x805045f6 = _mtx_lock_sleep+0x486 _mtx_lock_flags() at 0x80504814 = _mtx_lock_flags+0x104 lock_mtx() at 0x805049ca = lock_mtx+0x1a _sleep() at 0x80524589 = _sleep+0x4f9 taskqueue_thread_loop() at 0x805664c8 = taskqueue_thread_loop+0xa8 fork_exit() at 0x804e5d6a = fork_exit+0x1aa fork_trampoline() at 0x806ea2ce = fork_trampoline+0xe -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
on 20/09/2012 14:04 Andriy Gapon said the following: on 19/09/2012 10:33 Attilio Rao said the following: It is hard for me to tell if this is subject to same issues because I do not entirely understand where the locking was happening in your patch. Can you try testing this with your already KTR instrumented test and possibly report? The patch works well as far as I can tell. Thank you! After more testing it seems that the idea was not complete. While loads are set better tdq_lowpri (which is generally = newpri) on the current CPU still prevents the boosted thread from taking advantage of the lowered load. Not sure how to refresh tdq_lowpri in this scenario... There is one warning with full witness enables but it appears to be harmless (so far): Acquiring duplicate lock of same type: sched lock 1st sched lock 1 @ /usr/src/sys/kern/subr_turnstile.c:212 2nd sched lock 0 @ /usr/src/sys/kern/sched_ule.c:1244 KDB: stack backtrace: db_trace_self_wrapper() at 0x802d238a = db_trace_self_wrapper+0x2a kdb_backtrace() at 0x80555f4a = kdb_backtrace+0x3a _witness_debugger() at 0x8056e2bc = _witness_debugger+0x2c witness_checkorder() at 0x8056f759 = witness_checkorder+0x579 _mtx_lock_spin_flags() at 0x80504bcd = _mtx_lock_spin_flags+0x10d sched_pickcpu() at 0x80547829 = sched_pickcpu+0x199 sched_add() at 0x80547bdb = sched_add+0x14b sched_thread_priority() at 0x80548059 = sched_thread_priority+0x1c9 sched_lend_prio() at 0x80548344 = sched_lend_prio+0x14 propagate_priority() at 0x8056801e = propagate_priority+0x1ce turnstile_wait() at 0x8056959f = turnstile_wait+0x4ef _mtx_lock_sleep() at 0x805045f6 = _mtx_lock_sleep+0x486 _mtx_lock_flags() at 0x80504814 = _mtx_lock_flags+0x104 lock_mtx() at 0x805049ca = lock_mtx+0x1a _sleep() at 0x80524589 = _sleep+0x4f9 taskqueue_thread_loop() at 0x805664c8 = taskqueue_thread_loop+0xa8 fork_exit() at 0x804e5d6a = fork_exit+0x1aa fork_trampoline() at 0x806ea2ce = fork_trampoline+0xe -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
on 20/09/2012 14:04 Andriy Gapon said the following: on 19/09/2012 10:33 Attilio Rao said the following: It is hard for me to tell if this is subject to same issues because I do not entirely understand where the locking was happening in your patch. Can you try testing this with your already KTR instrumented test and possibly report? The patch works well as far as I can tell. Thank you! After more testing it seems that the idea was not complete. While loads are set better tdq_lowpri (which is generally = newpri) on the current CPU still prevents the boosted thread from taking advantage of the lowered load. Not sure how to refresh tdq_lowpri in this scenario... There is one warning with full witness enables but it appears to be harmless (so far): Acquiring duplicate lock of same type: sched lock 1st sched lock 1 @ /usr/src/sys/kern/subr_turnstile.c:212 2nd sched lock 0 @ /usr/src/sys/kern/sched_ule.c:1244 KDB: stack backtrace: db_trace_self_wrapper() at 0x802d238a = db_trace_self_wrapper+0x2a kdb_backtrace() at 0x80555f4a = kdb_backtrace+0x3a _witness_debugger() at 0x8056e2bc = _witness_debugger+0x2c witness_checkorder() at 0x8056f759 = witness_checkorder+0x579 _mtx_lock_spin_flags() at 0x80504bcd = _mtx_lock_spin_flags+0x10d sched_pickcpu() at 0x80547829 = sched_pickcpu+0x199 sched_add() at 0x80547bdb = sched_add+0x14b sched_thread_priority() at 0x80548059 = sched_thread_priority+0x1c9 sched_lend_prio() at 0x80548344 = sched_lend_prio+0x14 propagate_priority() at 0x8056801e = propagate_priority+0x1ce turnstile_wait() at 0x8056959f = turnstile_wait+0x4ef _mtx_lock_sleep() at 0x805045f6 = _mtx_lock_sleep+0x486 _mtx_lock_flags() at 0x80504814 = _mtx_lock_flags+0x104 lock_mtx() at 0x805049ca = lock_mtx+0x1a _sleep() at 0x80524589 = _sleep+0x4f9 taskqueue_thread_loop() at 0x805664c8 = taskqueue_thread_loop+0xa8 fork_exit() at 0x804e5d6a = fork_exit+0x1aa fork_trampoline() at 0x806ea2ce = fork_trampoline+0xe -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
On 9/20/12, Andriy Gapon a...@freebsd.org wrote: on 19/09/2012 10:33 Attilio Rao said the following: It is hard for me to tell if this is subject to same issues because I do not entirely understand where the locking was happening in your patch. Can you try testing this with your already KTR instrumented test and possibly report? The patch works well as far as I can tell. Thank you! There is one warning with full witness enables but it appears to be harmless (so far): Andiy, thanks a lot for your testing and reports you made so far. Unfortunately I'm going off for 2 weeks now and I won't work on FreeBSD for that timeframe. I will get back to those in 2 weeks then. If you want to play more with this idea feel free to extend/fix/etc. this patch. Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
on 19/09/2012 02:01 Attilio Rao said the following: On Tue, Sep 18, 2012 at 8:07 PM, Andriy Gapon a...@freebsd.org wrote: on 18/09/2012 19:50 Attilio Rao said the following: On 9/18/12, Andriy Gapon a...@freebsd.org wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. I think that the idea is bright, but I have reservations against the implementation because it seems to me there are too many layering violations. Just one - for a layer between tunrstile and scheduler :-) But I agree. What is suggest is somewhat summarized like that: - Add a new SRQ_WILLSLEEP or the name you prefer - Add a new flags argument to sched_lend_prio() (both ule and 4bsd) and sched_thread_priority (ule only) - sched_thread_priority() will pass down the new flag to sched_add() which passed down to sched_pickcpu(). This way sched_pickcpu() has the correct knowledge of what is going on and it can make the right decision. You likely don't need to lower the tdq_load at that time either this way, because sched_pickcpu() can just adjust it locally for its decision. What do you think? This sounds easy but it is not quite so given the implementation of sched_pickcpu and sched_lowest. This is probably more work than I am able to take now. I think actually that the attached patch should do what you need. Of course there is more runqueue locking, due to the tdq_load fondling, but I'm not sure it is really a big deal. I didn't test it yet, as I understand you have a test case, so maybe you can. However if Jeff agrees I can send the patch to flo@ for performance evaluation. Thoughts? Originally I had a similar patch with a small difference that I did tdq_load manipulations in sched_thread_priority around sched_add call. That patch produced a deadlock because of the need for TDQ_LOCK. Index: sys/sys/sched.h === --- sys/sys/sched.h (revisione 240672) +++ sys/sys/sched.h (copia locale) @@ -91,7 +91,7 @@ void sched_nice(struct proc *p, int nice); */ void sched_exit_thread(struct thread *td, struct thread *child); void sched_fork_thread(struct thread *td, struct thread *child); -void sched_lend_prio(struct thread *td, u_char prio); +void sched_lend_prio(struct thread *td, u_char prio, int flags); void sched_lend_user_prio(struct thread *td, u_char pri); fixpt_tsched_pctcpu(struct thread *td); void sched_prio(struct thread *td, u_char prio); @@ -161,6 +161,7 @@ sched_unpin(void) #defineSRQ_INTR0x0004 /* It is probably urgent. */ #defineSRQ_PREEMPTED 0x0008 /* has been preempted.. be kind */ #defineSRQ_BORROWING 0x0010 /* Priority updated due to prio_lend */ +#defineSRQ_WILLSWITCH 0x0020 /* curthread will be switched out */ /* Scheduler stats. */ #ifdef SCHED_STATS Index: sys/kern/sched_ule.c === --- sys/kern/sched_ule.c(revisione 240672) +++ sys/kern/sched_ule.c(copia locale) @@ -290,7 +290,7 @@ static struct tdq tdq_cpu; #defineTDQ_LOCKPTR(t) ((t)-tdq_lock) static void sched_priority(struct thread *); -static void sched_thread_priority(struct thread *,
Re: ule+smp: small optimization for turnstile priority lending
On Wed, Sep 19, 2012 at 7:03 AM, Andriy Gapon a...@freebsd.org wrote: on 19/09/2012 02:01 Attilio Rao said the following: On Tue, Sep 18, 2012 at 8:07 PM, Andriy Gapon a...@freebsd.org wrote: on 18/09/2012 19:50 Attilio Rao said the following: On 9/18/12, Andriy Gapon a...@freebsd.org wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. I think that the idea is bright, but I have reservations against the implementation because it seems to me there are too many layering violations. Just one - for a layer between tunrstile and scheduler :-) But I agree. What is suggest is somewhat summarized like that: - Add a new SRQ_WILLSLEEP or the name you prefer - Add a new flags argument to sched_lend_prio() (both ule and 4bsd) and sched_thread_priority (ule only) - sched_thread_priority() will pass down the new flag to sched_add() which passed down to sched_pickcpu(). This way sched_pickcpu() has the correct knowledge of what is going on and it can make the right decision. You likely don't need to lower the tdq_load at that time either this way, because sched_pickcpu() can just adjust it locally for its decision. What do you think? This sounds easy but it is not quite so given the implementation of sched_pickcpu and sched_lowest. This is probably more work than I am able to take now. I think actually that the attached patch should do what you need. Of course there is more runqueue locking, due to the tdq_load fondling, but I'm not sure it is really a big deal. I didn't test it yet, as I understand you have a test case, so maybe you can. However if Jeff agrees I can send the patch to flo@ for performance evaluation. Thoughts? Originally I had a similar patch with a small difference that I did tdq_load manipulations in sched_thread_priority around sched_add call. That patch produced a deadlock because of the need for TDQ_LOCK. It is hard for me to tell if this is subject to same issues because I do not entirely understand where the locking was happening in your patch. Can you try testing this with your already KTR instrumented test and possibly report? Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
ule+smp: small optimization for turnstile priority lending
Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. Here is a proposed solution: turnstile_wait: optimize priority lending to a thread on a runqueue As the current thread is definitely going into mi_switch, it now removes its load before doing priority propagation which can potentially result in sched_add. In the SMP ULE case the latter searches for the least loaded CPU to place a boosted thread, which is supposedly about to run. diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c index 8e466cd..3299cae 100644 --- a/sys/kern/sched_ule.c +++ b/sys/kern/sched_ule.c @@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread *newtd, int flags) /* This thread must be going to sleep. */ TDQ_LOCK(tdq); mtx = thread_lock_block(td); - tdq_load_rem(tdq, td); +#if defined(SMP) + if ((flags SW_TYPE_MASK) != SWT_TURNSTILE) +#endif + tdq_load_rem(tdq, td); } /* * We enter here with the thread blocked and assigned to the @@ -2412,6 +2415,21 @@ sched_rem(struct thread *td) tdq_setlowpri(tdq, NULL); } +void +sched_load_rem(struct thread *td) +{ + struct tdq *tdq; + + KASSERT(td == curthread, + (sched_rem_load: only curthread is supported)); + KASSERT(td-td_oncpu == td-td_sched-ts_cpu, + (thread running on cpu different from ts_cpu)); + tdq = TDQ_CPU(td-td_sched-ts_cpu); + TDQ_LOCK_ASSERT(tdq, MA_OWNED); + MPASS(td-td_lock == TDQ_LOCKPTR(tdq)); + tdq_load_rem(tdq, td); +} + /* * Fetch cpu utilization information. Updates on demand. */ diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c index 31d16fe..d1d68e9 100644 --- a/sys/kern/subr_turnstile.c +++ b/sys/kern/subr_turnstile.c @@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread *owner, int queue) LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash); } thread_lock(td); +#if defined(SCHED_ULE) defined(SMP) + /* +* Remove load earlier so that it does not affect cpu selection +* for a thread waken up due to priority lending, if any. +*/ + sched_load_rem(td); +#endif thread_lock_set(td, ts-ts_lock); td-td_turnstile = NULL; diff --git a/sys/sys/sched.h b/sys/sys/sched.h index 4b8387c..b1ead1b 100644 --- a/sys/sys/sched.h +++ b/sys/sys/sched.h @@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td); void sched_add(struct thread *td, int flags); void sched_clock(struct thread *td); void sched_rem(struct thread *td); +#if defined(SCHED_ULE) defined(SMP) +void sched_load_rem(struct thread *td); +#endif void sched_tick(int cnt); void sched_relinquish(struct thread *td); struct thread *sched_choose(void); -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
On 9/18/12, Andriy Gapon a...@freebsd.org wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. I think that the idea is bright, but I have reservations against the implementation because it seems to me there are too many layering violations. What is suggest is somewhat summarized like that: - Add a new SRQ_WILLSLEEP or the name you prefer - Add a new flags argument to sched_lend_prio() (both ule and 4bsd) and sched_thread_priority (ule only) - sched_thread_priority() will pass down the new flag to sched_add() which passed down to sched_pickcpu(). This way sched_pickcpu() has the correct knowledge of what is going on and it can make the right decision. You likely don't need to lower the tdq_load at that time either this way, because sched_pickcpu() can just adjust it locally for its decision. What do you think? Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
on 18/09/2012 19:50 Attilio Rao said the following: On 9/18/12, Andriy Gapon a...@freebsd.org wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. I think that the idea is bright, but I have reservations against the implementation because it seems to me there are too many layering violations. Just one - for a layer between tunrstile and scheduler :-) But I agree. What is suggest is somewhat summarized like that: - Add a new SRQ_WILLSLEEP or the name you prefer - Add a new flags argument to sched_lend_prio() (both ule and 4bsd) and sched_thread_priority (ule only) - sched_thread_priority() will pass down the new flag to sched_add() which passed down to sched_pickcpu(). This way sched_pickcpu() has the correct knowledge of what is going on and it can make the right decision. You likely don't need to lower the tdq_load at that time either this way, because sched_pickcpu() can just adjust it locally for its decision. What do you think? This sounds easy but it is not quite so given the implementation of sched_pickcpu and sched_lowest. This is probably more work than I am able to take now. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
On Tue, 18 Sep 2012, Andriy Gapon wrote: on 18/09/2012 19:50 Attilio Rao said the following: On 9/18/12, Andriy Gapon a...@freebsd.org wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. I think that the idea is bright, but I have reservations against the implementation because it seems to me there are too many layering violations. Just one - for a layer between tunrstile and scheduler :-) But I agree. What is suggest is somewhat summarized like that: - Add a new SRQ_WILLSLEEP or the name you prefer - Add a new flags argument to sched_lend_prio() (both ule and 4bsd) and sched_thread_priority (ule only) - sched_thread_priority() will pass down the new flag to sched_add() which passed down to sched_pickcpu(). This way sched_pickcpu() has the correct knowledge of what is going on and it can make the right decision. You likely don't need to lower the tdq_load at that time either this way, because sched_pickcpu() can just adjust it locally for its decision. What do you think? This sounds easy but it is not quite so given the implementation of sched_pickcpu and sched_lowest. This is probably more work than I am able to take now. I agree with Attillio's assessment. I have tried to do similar things before for ithreads etc. I think a more generic approach would be good. I will put it on my list of things to look at and we'll see who gets time first. Thanks, Jeff -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: ule+smp: small optimization for turnstile priority lending
On Tue, Sep 18, 2012 at 8:07 PM, Andriy Gapon a...@freebsd.org wrote: on 18/09/2012 19:50 Attilio Rao said the following: On 9/18/12, Andriy Gapon a...@freebsd.org wrote: Here is a snippet that demonstrates the issue on a supposedly fully loaded 2-processor system: 136794 0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818, state:running, attributes: prio:122 136793 0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916, state:yielding, attributes: prio:183, wmesg:(null), lockname:(null) 136792 1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14, state:running, attributes: prio:255 136791 1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0, attributes: none 136790 1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473, state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx 136789 1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2, attributes: none 136788 1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473, point:wokeup, attributes: linkedto:Xorg tid 102818 136787 1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq add, attributes: prio:122, linkedto:firefox tid 113473 136786 1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1, attributes: none 136785 1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818, state:runq rem, attributes: prio:176 136784 1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818, point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473 136783 1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473, state:running, attributes: prio:104 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero load. I think that the idea is bright, but I have reservations against the implementation because it seems to me there are too many layering violations. Just one - for a layer between tunrstile and scheduler :-) But I agree. What is suggest is somewhat summarized like that: - Add a new SRQ_WILLSLEEP or the name you prefer - Add a new flags argument to sched_lend_prio() (both ule and 4bsd) and sched_thread_priority (ule only) - sched_thread_priority() will pass down the new flag to sched_add() which passed down to sched_pickcpu(). This way sched_pickcpu() has the correct knowledge of what is going on and it can make the right decision. You likely don't need to lower the tdq_load at that time either this way, because sched_pickcpu() can just adjust it locally for its decision. What do you think? This sounds easy but it is not quite so given the implementation of sched_pickcpu and sched_lowest. This is probably more work than I am able to take now. I think actually that the attached patch should do what you need. Of course there is more runqueue locking, due to the tdq_load fondling, but I'm not sure it is really a big deal. I didn't test it yet, as I understand you have a test case, so maybe you can. However if Jeff agrees I can send the patch to flo@ for performance evaluation. Thoughts? Attilio Index: sys/sys/sched.h === --- sys/sys/sched.h (revisione 240672) +++ sys/sys/sched.h (copia locale) @@ -91,7 +91,7 @@ void sched_nice(struct proc *p, int nice); */ void sched_exit_thread(struct thread *td, struct thread *child); void sched_fork_thread(struct thread *td, struct thread *child); -void sched_lend_prio(struct thread *td, u_char prio); +void sched_lend_prio(struct thread *td, u_char prio, int flags); void sched_lend_user_prio(struct thread *td, u_char pri); fixpt_tsched_pctcpu(struct thread *td); void sched_prio(struct thread *td, u_char prio); @@ -161,6 +161,7 @@ sched_unpin(void) #defineSRQ_INTR0x0004 /* It is probably urgent. */ #defineSRQ_PREEMPTED 0x0008 /* has been preempted.. be kind */ #defineSRQ_BORROWING 0x0010 /* Priority updated due to prio_lend */ +#defineSRQ_WILLSWITCH 0x0020 /* curthread will be switched out */ /* Scheduler stats. */ #ifdef SCHED_STATS Index: sys/kern/sched_ule.c === --- sys/kern/sched_ule.c(revisione 240672) +++ sys/kern/sched_ule.c(copia locale) @@ -290,7 +290,7 @@ static struct tdq tdq_cpu; #defineTDQ_LOCKPTR(t) ((t)-tdq_lock) static void sched_priority(struct thread *); -static void sched_thread_priority(struct thread *, u_char); +static void sched_thread_priority(struct thread *, u_char, int flags); static int sched_interact_score(struct thread *); static void sched_interact_update(struct thread *); static void sched_interact_fork(struct thread *); @@ -1233,6 +1233,18 @@ sched_pickcpu(struct thread *td,