Re: ule+smp: small optimization for turnstile priority lending

2012-11-09 Thread Andre Oppermann

On 07.11.2012 08:45, David Xu wrote:

On 2012/11/07 14:17, Jeff Roberson wrote:

On Wed, 7 Nov 2012, David Xu wrote:


On 2012/11/06 19:03, Attilio Rao wrote:

On 9/20/12, David Xu davi...@freebsd.org wrote:

I found another scenario in taskqueue, in the function
taskqueue_terminate, current thread tries to wake
another thread up and sleep immediately, the tq_mutex sometimes
is a spinlock. So if you remove one thread load from current cpu
before wakeup, the resumed thread may be put on same cpu,
so it will optimize the cpu scheduling too.


I think that in order to fit with sched_add() modifies I have in mind
(see other patches within this thread) wakeup() should grow a new
argument, or maybe we can use wakeup_flags() new KPI.
If the latter is the case, I would also propose to let wakeup_one() to
be absorbed into wakeup_flags() with its own flag.



Yes, I like the idea.



From ~2007


http://people.freebsd.org/~jeff/wakeupflags.diff

This has some different optimizations that can be done when you have a
hinted wakeup.



wakeup_flags is a good start point.
But what flags should be supported? I think WAKEUP_WILLSLEEP may be
needed if the current thread will switch out as soon as possible.


WAKEUP_ONE is important for accept sockets.  We don't want all idle
threads to compete and contend on a new connection.

--
Andre


I see you have added WAKEUP_LOCAL and WAKEUP_TAIL. Are they for cache
optimization ? both of them are arguable.

WAKEUP_LOCAL increases cpu migration, not good, since one benefit of
ULE is keeping thread on its old cpu, the WAKEUP_LOCAL violates the
design.
WAKEUP_LOCAL used in pipe code may not be correct if the pipe only
has few of bytes to be transfered or receiver only eats a few bytes
each time, so why bother to move other threads to local cpu ?
If the other threads have many bytes cached in their old cpu, this
migration is expensive.

WAKEUP_TAIL is a good idea, I guess that you want to wake up hot-thread
its code and data are still in its old cpu's cache. But this will lead
to unfair. I'd like the sleep queue to implement a policy like I did
in libthr, it picks a thread at tail of queue in a fixed percentage, it
does have some level of unfair but not fatal at all.


Thanks,
Jeff


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org




___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-11-07 Thread Jeff Roberson

On Wed, 7 Nov 2012, David Xu wrote:


On 2012/11/06 19:03, Attilio Rao wrote:

On 9/20/12, David Xu davi...@freebsd.org wrote:

On 2012/09/18 22:05, Andriy Gapon wrote:


Here is a snippet that demonstrates the issue on a supposedly fully
loaded
2-processor system:

136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid
102818,
state:running, attributes: prio:122

136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
111916,
state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid
14,
state:running, attributes: prio:255

136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
counter:0,
attributes: none

136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
113473,
state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
counter:2,
attributes: none

136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
113473,
point:wokeup, attributes: linkedto:Xorg tid 102818

136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid
102818,
state:runq add, attributes: prio:122, linkedto:firefox tid 113473

136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
counter:1,
attributes: none

136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid
102818,
state:runq rem, attributes: prio:176

136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid
102818,
point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid
113473

136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
113473,
state:running, attributes: prio:104

See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
preempted
cc1plus thread (I do have preemption enabled) only to leave CPU 1 with
zero load.

Here is a proposed solution:

  turnstile_wait: optimize priority lending to a thread on a runqueue

  As the current thread is definitely going into mi_switch, it now
removes
  its load before doing priority propagation which can potentially
result
  in sched_add.  In the SMP  ULE case the latter searches for the
  least loaded CPU to place a boosted thread, which is supposedly
about
  to run.

diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
index 8e466cd..3299cae 100644
--- a/sys/kern/sched_ule.c
+++ b/sys/kern/sched_ule.c
@@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread
*newtd, int
flags)
/* This thread must be going to sleep. */
TDQ_LOCK(tdq);
mtx = thread_lock_block(td);
-   tdq_load_rem(tdq, td);
+#if defined(SMP)
+   if ((flags  SW_TYPE_MASK) != SWT_TURNSTILE)
+#endif
+   tdq_load_rem(tdq, td);
}
/*
 * We enter here with the thread blocked and assigned to the
@@ -2412,6 +2415,21 @@ sched_rem(struct thread *td)
tdq_setlowpri(tdq, NULL);
   }

+void
+sched_load_rem(struct thread *td)
+{
+   struct tdq *tdq;
+
+   KASSERT(td == curthread,
+   (sched_rem_load: only curthread is supported));
+   KASSERT(td-td_oncpu == td-td_sched-ts_cpu,
+   (thread running on cpu different from ts_cpu));
+   tdq = TDQ_CPU(td-td_sched-ts_cpu);
+   TDQ_LOCK_ASSERT(tdq, MA_OWNED);
+   MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
+   tdq_load_rem(tdq, td);
+}
+
   /*
* Fetch cpu utilization information.  Updates on demand.
*/
diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c
index 31d16fe..d1d68e9 100644
--- a/sys/kern/subr_turnstile.c
+++ b/sys/kern/subr_turnstile.c
@@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread
*owner,
int queue)
LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash);
}
thread_lock(td);
+#if defined(SCHED_ULE)  defined(SMP)
+   /*
+* Remove load earlier so that it does not affect cpu selection
+* for a thread waken up due to priority lending, if any.
+*/
+   sched_load_rem(td);
+#endif
thread_lock_set(td, ts-ts_lock);
td-td_turnstile = NULL;

diff --git a/sys/sys/sched.h b/sys/sys/sched.h
index 4b8387c..b1ead1b 100644
--- a/sys/sys/sched.h
+++ b/sys/sys/sched.h
@@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td);
   void sched_add(struct thread *td, int flags);
   void sched_clock(struct thread *td);
   void sched_rem(struct thread *td);
+#if defined(SCHED_ULE)  defined(SMP)
+void   sched_load_rem(struct thread *td);
+#endif
   void sched_tick(int cnt);
   void sched_relinquish(struct thread *td);
   struct thread *sched_choose(void);



I found another scenario in taskqueue, in the function
taskqueue_terminate, current thread tries to wake
another thread up and sleep immediately, the tq_mutex sometimes
is a spinlock. So if you remove one thread load from current cpu
before wakeup, the resumed thread may be put on same cpu,
so it 

Re: ule+smp: small optimization for turnstile priority lending

2012-11-06 Thread Attilio Rao
On 10/29/12, Attilio Rao atti...@freebsd.org wrote:
 On Mon, Oct 29, 2012 at 4:56 PM, Attilio Rao atti...@freebsd.org wrote:
 On Wed, Oct 3, 2012 at 3:05 PM, Andriy Gapon a...@freebsd.org wrote:
 on 20/09/2012 16:14 Attilio Rao said the following:
 On 9/20/12, Andriy Gapon a...@freebsd.org wrote:
 [snip]
 The patch works well as far as I can tell.  Thank you!
 There is one warning with full witness enables but it appears to be
 harmless
 (so
 far):

 Andriy,
 thanks a lot for your testing and reports you made so far.
 Unfortunately I'm going off for 2 weeks now and I won't work on
 FreeBSD for that timeframe. I will get back to those in 2 weeks then.
 If you want  to play more with this idea feel free to extend/fix/etc.
 this patch.

 Unfortunately I haven't found time to work on this further, but I have
 some
 additional thoughts.

 First, I think that the witness message was not benign and it actually
 warns about
 the same kind of deadlock that I originally had.
 The problem is that sched_lend_prio() is called with target thread's
 td_lock held,
 which is a lock of tdq on the thread's CPU.  Then, in your patch, we
 acquire
 current tdq's lock to modify its load.  But if two tdq locks are to be
 held at the
 same time, then they must be locked using tdq_lock_pair, so that lock
 order is
 maintained.  With the patch no tdq lock order can be maintained (can
 arbitrary)
 and thus it is possible to run into a deadlock.

 Indeed. I realized this after thinking about this problem while I was
 on holiday.


 I see two possibilities here, but don't rule out that there can be more.

 1. Do something like my patch does.  That is, manipulate current thread's
 tdq in
 advance before any other thread/tdq locks are acquired (or td_lock is
 changed to
 point to a different lock and current tdq is unlocked).  The API can be
 made more
 generic in nature.  E.g. it can look like this:
 void
 sched_thread_block(struct thread *td, int inhibitor)
 {
 struct tdq *tdq;

 THREAD_LOCK_ASSERT(td, MA_OWNED);
 KASSERT(td == curthread,
 (sched_thread_block: only curthread is supported));
 tdq = TDQ_SELF();
 TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
 TD_SET_INHIB(td, inhibitor);
 tdq_load_rem(tdq, td);
 tdq_setlowpri(tdq, td);
 }


 2. Try to do things from sched_lend_prio based on curthread's state.
 This, as it
 seems, would require completely lock-less manipulations of current tdq.
 E.g. for
 tdq_load we could use atomic operations (it is already accessed
 locklessly, but
 not modified).  But for tdq_lowpri a more elaborate trick would be
 required like
 having a separate field for a temporary value.

 No, this is not going to work because tdq_lowpri and tdq_load are
 updated and used in the same critical path (and possibly also other
 tdq* members in tdq_runqueue_add(), for example, but I didn't bother
 to also check that).

 Ok, so here I wasn't completely right because td_lowpri and tdq_load
 are not read in the same critical path (cpu_search() and
 sched_pickcpu() in the self case).
 However, I was over-estimating a factor in this patch: I don't think
 we need to fix real tdq_load for self in that case before cpu_search()
 because self is handled in a very special way, with its own
 comparison. I then think that a patch local to sched_pickcpu() on the
 self path should be enough.

 This brings us to another bigger issue, however. tdq_lowpri handling
 is not perfect in that patch. Think about the case where the thread to
 be switched out is, infact, the lowest priority one. In that case,
 what happens is that what we should do is recalculating tdq_lowpri
 which is usually a very expensive operation and requires TDQ self
 locking to be in place to be brought on correctly.

I think that the easy fix for this is to implement a second field
called tdq_sec_lowpri where we store the next thread to be scheduled,
after tdq_lowpri. This is going to not really change anything because
we will get it for free when we already calculate tdq_lowpri and it
will let us the implement a definitive logic that helps in the case
you describe.

However I still want to think about the base idea behind the current
algorithm in the self/likely cpu pickup.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-11-06 Thread Attilio Rao
On 9/20/12, David Xu davi...@freebsd.org wrote:
 On 2012/09/18 22:05, Andriy Gapon wrote:

 Here is a snippet that demonstrates the issue on a supposedly fully
 loaded
 2-processor system:

 136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid
 102818,
 state:running, attributes: prio:122

 136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
 111916,
 state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

 136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid
 14,
 state:running, attributes: prio:255

 136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
 counter:0,
 attributes: none

 136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

 136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
 counter:2,
 attributes: none

 136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
 113473,
 point:wokeup, attributes: linkedto:Xorg tid 102818

 136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid
 102818,
 state:runq add, attributes: prio:122, linkedto:firefox tid 113473

 136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
 counter:1,
 attributes: none

 136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid
 102818,
 state:runq rem, attributes: prio:176

 136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid
 102818,
 point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid
 113473

 136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:running, attributes: prio:104

 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
 preempted
 cc1plus thread (I do have preemption enabled) only to leave CPU 1 with
 zero load.

 Here is a proposed solution:

  turnstile_wait: optimize priority lending to a thread on a runqueue

  As the current thread is definitely going into mi_switch, it now
 removes
  its load before doing priority propagation which can potentially
 result
  in sched_add.  In the SMP  ULE case the latter searches for the
  least loaded CPU to place a boosted thread, which is supposedly
 about
  to run.

 diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
 index 8e466cd..3299cae 100644
 --- a/sys/kern/sched_ule.c
 +++ b/sys/kern/sched_ule.c
 @@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread
 *newtd, int
 flags)
  /* This thread must be going to sleep. */
  TDQ_LOCK(tdq);
  mtx = thread_lock_block(td);
 -tdq_load_rem(tdq, td);
 +#if defined(SMP)
 +if ((flags  SW_TYPE_MASK) != SWT_TURNSTILE)
 +#endif
 +tdq_load_rem(tdq, td);
  }
  /*
   * We enter here with the thread blocked and assigned to the
 @@ -2412,6 +2415,21 @@ sched_rem(struct thread *td)
  tdq_setlowpri(tdq, NULL);
   }

 +void
 +sched_load_rem(struct thread *td)
 +{
 +struct tdq *tdq;
 +
 +KASSERT(td == curthread,
 +(sched_rem_load: only curthread is supported));
 +KASSERT(td-td_oncpu == td-td_sched-ts_cpu,
 +(thread running on cpu different from ts_cpu));
 +tdq = TDQ_CPU(td-td_sched-ts_cpu);
 +TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 +MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
 +tdq_load_rem(tdq, td);
 +}
 +
   /*
* Fetch cpu utilization information.  Updates on demand.
*/
 diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c
 index 31d16fe..d1d68e9 100644
 --- a/sys/kern/subr_turnstile.c
 +++ b/sys/kern/subr_turnstile.c
 @@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread
 *owner,
 int queue)
  LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash);
  }
  thread_lock(td);
 +#if defined(SCHED_ULE)  defined(SMP)
 +/*
 + * Remove load earlier so that it does not affect cpu selection
 + * for a thread waken up due to priority lending, if any.
 + */
 +sched_load_rem(td);
 +#endif
  thread_lock_set(td, ts-ts_lock);
  td-td_turnstile = NULL;

 diff --git a/sys/sys/sched.h b/sys/sys/sched.h
 index 4b8387c..b1ead1b 100644
 --- a/sys/sys/sched.h
 +++ b/sys/sys/sched.h
 @@ -110,6 +110,9 @@ void sched_preempt(struct thread *td);
   void   sched_add(struct thread *td, int flags);
   void   sched_clock(struct thread *td);
   void   sched_rem(struct thread *td);
 +#if defined(SCHED_ULE)  defined(SMP)
 +voidsched_load_rem(struct thread *td);
 +#endif
   void   sched_tick(int cnt);
   void   sched_relinquish(struct thread *td);
   struct thread *sched_choose(void);


 I found another scenario in taskqueue, in the function
 taskqueue_terminate, current thread tries to wake
 another thread up and sleep immediately, the tq_mutex sometimes
 is a spinlock. So if you remove one thread load from current cpu
 before wakeup, the resumed thread may be put on same cpu,
 so it will optimize the cpu 

Re: ule+smp: small optimization for turnstile priority lending

2012-11-06 Thread David Xu

On 2012/11/06 19:03, Attilio Rao wrote:

On 9/20/12, David Xu davi...@freebsd.org wrote:

On 2012/09/18 22:05, Andriy Gapon wrote:


Here is a snippet that demonstrates the issue on a supposedly fully
loaded
2-processor system:

136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid
102818,
state:running, attributes: prio:122

136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
111916,
state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid
14,
state:running, attributes: prio:255

136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
counter:0,
attributes: none

136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
113473,
state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
counter:2,
attributes: none

136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
113473,
point:wokeup, attributes: linkedto:Xorg tid 102818

136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid
102818,
state:runq add, attributes: prio:122, linkedto:firefox tid 113473

136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
counter:1,
attributes: none

136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid
102818,
state:runq rem, attributes: prio:176

136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid
102818,
point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid
113473

136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
113473,
state:running, attributes: prio:104

See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
preempted
cc1plus thread (I do have preemption enabled) only to leave CPU 1 with
zero load.

Here is a proposed solution:

  turnstile_wait: optimize priority lending to a thread on a runqueue

  As the current thread is definitely going into mi_switch, it now
removes
  its load before doing priority propagation which can potentially
result
  in sched_add.  In the SMP  ULE case the latter searches for the
  least loaded CPU to place a boosted thread, which is supposedly
about
  to run.

diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
index 8e466cd..3299cae 100644
--- a/sys/kern/sched_ule.c
+++ b/sys/kern/sched_ule.c
@@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread
*newtd, int
flags)
/* This thread must be going to sleep. */
TDQ_LOCK(tdq);
mtx = thread_lock_block(td);
-   tdq_load_rem(tdq, td);
+#if defined(SMP)
+   if ((flags  SW_TYPE_MASK) != SWT_TURNSTILE)
+#endif
+   tdq_load_rem(tdq, td);
}
/*
 * We enter here with the thread blocked and assigned to the
@@ -2412,6 +2415,21 @@ sched_rem(struct thread *td)
tdq_setlowpri(tdq, NULL);
   }

+void
+sched_load_rem(struct thread *td)
+{
+   struct tdq *tdq;
+
+   KASSERT(td == curthread,
+   (sched_rem_load: only curthread is supported));
+   KASSERT(td-td_oncpu == td-td_sched-ts_cpu,
+   (thread running on cpu different from ts_cpu));
+   tdq = TDQ_CPU(td-td_sched-ts_cpu);
+   TDQ_LOCK_ASSERT(tdq, MA_OWNED);
+   MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
+   tdq_load_rem(tdq, td);
+}
+
   /*
* Fetch cpu utilization information.  Updates on demand.
*/
diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c
index 31d16fe..d1d68e9 100644
--- a/sys/kern/subr_turnstile.c
+++ b/sys/kern/subr_turnstile.c
@@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread
*owner,
int queue)
LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash);
}
thread_lock(td);
+#if defined(SCHED_ULE)  defined(SMP)
+   /*
+* Remove load earlier so that it does not affect cpu selection
+* for a thread waken up due to priority lending, if any.
+*/
+   sched_load_rem(td);
+#endif
thread_lock_set(td, ts-ts_lock);
td-td_turnstile = NULL;

diff --git a/sys/sys/sched.h b/sys/sys/sched.h
index 4b8387c..b1ead1b 100644
--- a/sys/sys/sched.h
+++ b/sys/sys/sched.h
@@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td);
   void sched_add(struct thread *td, int flags);
   void sched_clock(struct thread *td);
   void sched_rem(struct thread *td);
+#if defined(SCHED_ULE)  defined(SMP)
+void   sched_load_rem(struct thread *td);
+#endif
   void sched_tick(int cnt);
   void sched_relinquish(struct thread *td);
   struct thread *sched_choose(void);



I found another scenario in taskqueue, in the function
taskqueue_terminate, current thread tries to wake
another thread up and sleep immediately, the tq_mutex sometimes
is a spinlock. So if you remove one thread load from current cpu
before wakeup, the resumed thread may be put on same cpu,
so it will optimize the cpu scheduling too.


I 

Re: ule+smp: small optimization for turnstile priority lending

2012-11-06 Thread David Xu

On 2012/11/07 14:17, Jeff Roberson wrote:

On Wed, 7 Nov 2012, David Xu wrote:


On 2012/11/06 19:03, Attilio Rao wrote:

On 9/20/12, David Xu davi...@freebsd.org wrote:

On 2012/09/18 22:05, Andriy Gapon wrote:


Here is a snippet that demonstrates the issue on a supposedly fully
loaded
2-processor system:

136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid
102818,
state:running, attributes: prio:122

136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
111916,
state:yielding, attributes: prio:183, wmesg:(null),
lockname:(null)

136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1
tid
14,
state:running, attributes: prio:255

136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
counter:0,
attributes: none

136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
113473,
state:blocked, attributes: prio:122, wmesg:(null),
lockname:unp_mtx

136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
counter:2,
attributes: none

136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
113473,
point:wokeup, attributes: linkedto:Xorg tid 102818

136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid
102818,
state:runq add, attributes: prio:122, linkedto:firefox tid 113473

136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
counter:1,
attributes: none

136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid
102818,
state:runq rem, attributes: prio:176

136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid
102818,
point:prio, attributes: prio:176, new prio:122, linkedto:firefox
tid
113473

136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
113473,
state:running, attributes: prio:104

See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
preempted
cc1plus thread (I do have preemption enabled) only to leave CPU 1 with
zero load.

Here is a proposed solution:

  turnstile_wait: optimize priority lending to a thread on a
runqueue

  As the current thread is definitely going into mi_switch, it now
removes
  its load before doing priority propagation which can potentially
result
  in sched_add.  In the SMP  ULE case the latter searches for
the
  least loaded CPU to place a boosted thread, which is supposedly
about
  to run.

diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
index 8e466cd..3299cae 100644
--- a/sys/kern/sched_ule.c
+++ b/sys/kern/sched_ule.c
@@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread
*newtd, int
flags)
   /* This thread must be going to sleep. */
   TDQ_LOCK(tdq);
   mtx = thread_lock_block(td);
-tdq_load_rem(tdq, td);
+#if defined(SMP)
+if ((flags  SW_TYPE_MASK) != SWT_TURNSTILE)
+#endif
+tdq_load_rem(tdq, td);
   }
   /*
* We enter here with the thread blocked and assigned to the
@@ -2412,6 +2415,21 @@ sched_rem(struct thread *td)
   tdq_setlowpri(tdq, NULL);
   }

+void
+sched_load_rem(struct thread *td)
+{
+struct tdq *tdq;
+
+KASSERT(td == curthread,
+(sched_rem_load: only curthread is supported));
+KASSERT(td-td_oncpu == td-td_sched-ts_cpu,
+(thread running on cpu different from ts_cpu));
+tdq = TDQ_CPU(td-td_sched-ts_cpu);
+TDQ_LOCK_ASSERT(tdq, MA_OWNED);
+MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
+tdq_load_rem(tdq, td);
+}
+
   /*
* Fetch cpu utilization information.  Updates on demand.
*/
diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c
index 31d16fe..d1d68e9 100644
--- a/sys/kern/subr_turnstile.c
+++ b/sys/kern/subr_turnstile.c
@@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct
thread
*owner,
int queue)
   LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash);
   }
   thread_lock(td);
+#if defined(SCHED_ULE)  defined(SMP)
+/*
+ * Remove load earlier so that it does not affect cpu selection
+ * for a thread waken up due to priority lending, if any.
+ */
+sched_load_rem(td);
+#endif
   thread_lock_set(td, ts-ts_lock);
   td-td_turnstile = NULL;

diff --git a/sys/sys/sched.h b/sys/sys/sched.h
index 4b8387c..b1ead1b 100644
--- a/sys/sys/sched.h
+++ b/sys/sys/sched.h
@@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td);
   voidsched_add(struct thread *td, int flags);
   voidsched_clock(struct thread *td);
   voidsched_rem(struct thread *td);
+#if defined(SCHED_ULE)  defined(SMP)
+voidsched_load_rem(struct thread *td);
+#endif
   voidsched_tick(int cnt);
   voidsched_relinquish(struct thread *td);
   struct thread *sched_choose(void);



I found another scenario in taskqueue, in the function
taskqueue_terminate, current thread tries to wake
another thread up and sleep immediately, the tq_mutex sometimes
is a spinlock. So if you remove one thread load from current cpu
before wakeup, the resumed thread may be put on same cpu,
so it will optimize the cpu scheduling too.


I think 

Re: ule+smp: small optimization for turnstile priority lending

2012-10-29 Thread Attilio Rao
On Wed, Oct 3, 2012 at 3:05 PM, Andriy Gapon a...@freebsd.org wrote:
 on 20/09/2012 16:14 Attilio Rao said the following:
 On 9/20/12, Andriy Gapon a...@freebsd.org wrote:
 [snip]
 The patch works well as far as I can tell.  Thank you!
 There is one warning with full witness enables but it appears to be harmless
 (so
 far):

 Andriy,
 thanks a lot for your testing and reports you made so far.
 Unfortunately I'm going off for 2 weeks now and I won't work on
 FreeBSD for that timeframe. I will get back to those in 2 weeks then.
 If you want  to play more with this idea feel free to extend/fix/etc.
 this patch.

 Unfortunately I haven't found time to work on this further, but I have some
 additional thoughts.

 First, I think that the witness message was not benign and it actually warns 
 about
 the same kind of deadlock that I originally had.
 The problem is that sched_lend_prio() is called with target thread's td_lock 
 held,
 which is a lock of tdq on the thread's CPU.  Then, in your patch, we acquire
 current tdq's lock to modify its load.  But if two tdq locks are to be held 
 at the
 same time, then they must be locked using tdq_lock_pair, so that lock order is
 maintained.  With the patch no tdq lock order can be maintained (can 
 arbitrary)
 and thus it is possible to run into a deadlock.

Indeed. I realized this after thinking about this problem while I was
on holiday.


 I see two possibilities here, but don't rule out that there can be more.

 1. Do something like my patch does.  That is, manipulate current thread's tdq 
 in
 advance before any other thread/tdq locks are acquired (or td_lock is changed 
 to
 point to a different lock and current tdq is unlocked).  The API can be made 
 more
 generic in nature.  E.g. it can look like this:
 void
 sched_thread_block(struct thread *td, int inhibitor)
 {
 struct tdq *tdq;

 THREAD_LOCK_ASSERT(td, MA_OWNED);
 KASSERT(td == curthread,
 (sched_thread_block: only curthread is supported));
 tdq = TDQ_SELF();
 TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
 TD_SET_INHIB(td, inhibitor);
 tdq_load_rem(tdq, td);
 tdq_setlowpri(tdq, td);
 }


 2. Try to do things from sched_lend_prio based on curthread's state.  This, 
 as it
 seems, would require completely lock-less manipulations of current tdq.  E.g. 
 for
 tdq_load we could use atomic operations (it is already accessed locklessly, 
 but
 not modified).  But for tdq_lowpri a more elaborate trick would be required 
 like
 having a separate field for a temporary value.

No, this is not going to work because tdq_lowpri and tdq_load are
updated and used in the same critical path (and possibly also other
tdq* members in tdq_runqueue_add(), for example, but I didn't bother
to also check that).

Let me think some more about this and get back to you.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-10-29 Thread Attilio Rao
On Mon, Oct 29, 2012 at 4:56 PM, Attilio Rao atti...@freebsd.org wrote:
 On Wed, Oct 3, 2012 at 3:05 PM, Andriy Gapon a...@freebsd.org wrote:
 on 20/09/2012 16:14 Attilio Rao said the following:
 On 9/20/12, Andriy Gapon a...@freebsd.org wrote:
 [snip]
 The patch works well as far as I can tell.  Thank you!
 There is one warning with full witness enables but it appears to be 
 harmless
 (so
 far):

 Andriy,
 thanks a lot for your testing and reports you made so far.
 Unfortunately I'm going off for 2 weeks now and I won't work on
 FreeBSD for that timeframe. I will get back to those in 2 weeks then.
 If you want  to play more with this idea feel free to extend/fix/etc.
 this patch.

 Unfortunately I haven't found time to work on this further, but I have some
 additional thoughts.

 First, I think that the witness message was not benign and it actually warns 
 about
 the same kind of deadlock that I originally had.
 The problem is that sched_lend_prio() is called with target thread's td_lock 
 held,
 which is a lock of tdq on the thread's CPU.  Then, in your patch, we acquire
 current tdq's lock to modify its load.  But if two tdq locks are to be held 
 at the
 same time, then they must be locked using tdq_lock_pair, so that lock order 
 is
 maintained.  With the patch no tdq lock order can be maintained (can 
 arbitrary)
 and thus it is possible to run into a deadlock.

 Indeed. I realized this after thinking about this problem while I was
 on holiday.


 I see two possibilities here, but don't rule out that there can be more.

 1. Do something like my patch does.  That is, manipulate current thread's 
 tdq in
 advance before any other thread/tdq locks are acquired (or td_lock is 
 changed to
 point to a different lock and current tdq is unlocked).  The API can be made 
 more
 generic in nature.  E.g. it can look like this:
 void
 sched_thread_block(struct thread *td, int inhibitor)
 {
 struct tdq *tdq;

 THREAD_LOCK_ASSERT(td, MA_OWNED);
 KASSERT(td == curthread,
 (sched_thread_block: only curthread is supported));
 tdq = TDQ_SELF();
 TDQ_LOCK_ASSERT(tdq, MA_OWNED);
 MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
 TD_SET_INHIB(td, inhibitor);
 tdq_load_rem(tdq, td);
 tdq_setlowpri(tdq, td);
 }


 2. Try to do things from sched_lend_prio based on curthread's state.  This, 
 as it
 seems, would require completely lock-less manipulations of current tdq.  
 E.g. for
 tdq_load we could use atomic operations (it is already accessed locklessly, 
 but
 not modified).  But for tdq_lowpri a more elaborate trick would be required 
 like
 having a separate field for a temporary value.

 No, this is not going to work because tdq_lowpri and tdq_load are
 updated and used in the same critical path (and possibly also other
 tdq* members in tdq_runqueue_add(), for example, but I didn't bother
 to also check that).

Ok, so here I wasn't completely right because td_lowpri and tdq_load
are not read in the same critical path (cpu_search() and
sched_pickcpu() in the self case).
However, I was over-estimating a factor in this patch: I don't think
we need to fix real tdq_load for self in that case before cpu_search()
because self is handled in a very special way, with its own
comparison. I then think that a patch local to sched_pickcpu() on the
self path should be enough.

This brings us to another bigger issue, however. tdq_lowpri handling
is not perfect in that patch. Think about the case where the thread to
be switched out is, infact, the lowest priority one. In that case,
what happens is that what we should do is recalculating tdq_lowpri
which is usually a very expensive operation and requires TDQ self
locking to be in place to be brought on correctly.

At this point, I would suggest to use the new version of avg_ule.patch
which doesn't take into account tdq_lowpri. This means the
optimization won't work in all the cases but at least it is a good
trade-off with a more hacky version.

This also let me wonder about the tdq_lowpri check in the self case,
in general. Basically it forces sched_pickcpu() to select self if and
only if the new thread to schedule has an higher priority than the
lowest one on curcpu. Why is that like this? Exactly this check is
used to enforce some sort of fairness?
It would be good if Jeff spends a word or two on this check specifically.

Anyway the patch that implements what suggested, let me know your
thinking about it:
http://www.freebsd.org/~attilio/avg_ule2.patch

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-10-29 Thread Attilio Rao
On Mon, Oct 29, 2012 at 6:59 PM, Attilio Rao atti...@freebsd.org wrote:

[ trimm ]

 This also let me wonder about the tdq_lowpri check in the self case,
 in general. Basically it forces sched_pickcpu() to select self if and
 only if the new thread to schedule has an higher priority than the
 lowest one on curcpu. Why is that like this? Exactly this check is
 used to enforce some sort of fairness?
 It would be good if Jeff spends a word or two on this check specifically.

Also, I've read the code of tdq_setlowpri() more closely and I think
the name tdq_lowpri is highly misleading, because that seems to me the
*highest* priority thread that gets returned.
Said that, this means that self will be selected in sched_pickcpu() if
and only if the new thread has an higher priority than all the ones on
the self runqueue. Isn't all this a bit too extreme? Assuming that one
cpu has only a single high-priority thread and others are very loaded
it would likely make sense to not keep loading them but switch to the
self one, maybe.

 Anyway the patch that implements what suggested, let me know your
 thinking about it:
 http://www.freebsd.org/~attilio/avg_ule2.patch

I was thinking that however, maybe we could do the tdq_choose()
calculation if self == target, to have a little more chances to get
the optimization in place, eventually.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-10-03 Thread Andriy Gapon
on 20/09/2012 16:14 Attilio Rao said the following:
 On 9/20/12, Andriy Gapon a...@freebsd.org wrote:
[snip]
 The patch works well as far as I can tell.  Thank you!
 There is one warning with full witness enables but it appears to be harmless
 (so
 far):
 
 Andriy,
 thanks a lot for your testing and reports you made so far.
 Unfortunately I'm going off for 2 weeks now and I won't work on
 FreeBSD for that timeframe. I will get back to those in 2 weeks then.
 If you want  to play more with this idea feel free to extend/fix/etc.
 this patch.

Unfortunately I haven't found time to work on this further, but I have some
additional thoughts.

First, I think that the witness message was not benign and it actually warns 
about
the same kind of deadlock that I originally had.
The problem is that sched_lend_prio() is called with target thread's td_lock 
held,
which is a lock of tdq on the thread's CPU.  Then, in your patch, we acquire
current tdq's lock to modify its load.  But if two tdq locks are to be held at 
the
same time, then they must be locked using tdq_lock_pair, so that lock order is
maintained.  With the patch no tdq lock order can be maintained (can arbitrary)
and thus it is possible to run into a deadlock.

I see two possibilities here, but don't rule out that there can be more.

1. Do something like my patch does.  That is, manipulate current thread's tdq in
advance before any other thread/tdq locks are acquired (or td_lock is changed to
point to a different lock and current tdq is unlocked).  The API can be made 
more
generic in nature.  E.g. it can look like this:
void
sched_thread_block(struct thread *td, int inhibitor)
{
struct tdq *tdq;

THREAD_LOCK_ASSERT(td, MA_OWNED);
KASSERT(td == curthread,
(sched_thread_block: only curthread is supported));
tdq = TDQ_SELF();
TDQ_LOCK_ASSERT(tdq, MA_OWNED);
MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
TD_SET_INHIB(td, inhibitor);
tdq_load_rem(tdq, td);
tdq_setlowpri(tdq, td);
}


2. Try to do things from sched_lend_prio based on curthread's state.  This, as 
it
seems, would require completely lock-less manipulations of current tdq.  E.g. 
for
tdq_load we could use atomic operations (it is already accessed locklessly, but
not modified).  But for tdq_lowpri a more elaborate trick would be required like
having a separate field for a temporary value.

In any case, I'll have to revisit this later.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-09-20 Thread David Xu

On 2012/09/18 22:05, Andriy Gapon wrote:


Here is a snippet that demonstrates the issue on a supposedly fully loaded
2-processor system:

136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818,
state:running, attributes: prio:122

136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916,
state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14,
state:running, attributes: prio:255

136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0,
attributes: none

136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473,
state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2,
attributes: none

136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473,
point:wokeup, attributes: linkedto:Xorg tid 102818

136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818,
state:runq add, attributes: prio:122, linkedto:firefox tid 113473

136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1,
attributes: none

136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818,
state:runq rem, attributes: prio:176

136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818,
point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473

136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473,
state:running, attributes: prio:104

See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted
cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero 
load.

Here is a proposed solution:

 turnstile_wait: optimize priority lending to a thread on a runqueue

 As the current thread is definitely going into mi_switch, it now removes
 its load before doing priority propagation which can potentially result
 in sched_add.  In the SMP  ULE case the latter searches for the
 least loaded CPU to place a boosted thread, which is supposedly about
 to run.

diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
index 8e466cd..3299cae 100644
--- a/sys/kern/sched_ule.c
+++ b/sys/kern/sched_ule.c
@@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread *newtd, int
flags)
/* This thread must be going to sleep. */
TDQ_LOCK(tdq);
mtx = thread_lock_block(td);
-   tdq_load_rem(tdq, td);
+#if defined(SMP)
+   if ((flags  SW_TYPE_MASK) != SWT_TURNSTILE)
+#endif
+   tdq_load_rem(tdq, td);
}
/*
 * We enter here with the thread blocked and assigned to the
@@ -2412,6 +2415,21 @@ sched_rem(struct thread *td)
tdq_setlowpri(tdq, NULL);
  }

+void
+sched_load_rem(struct thread *td)
+{
+   struct tdq *tdq;
+
+   KASSERT(td == curthread,
+   (sched_rem_load: only curthread is supported));
+   KASSERT(td-td_oncpu == td-td_sched-ts_cpu,
+   (thread running on cpu different from ts_cpu));
+   tdq = TDQ_CPU(td-td_sched-ts_cpu);
+   TDQ_LOCK_ASSERT(tdq, MA_OWNED);
+   MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
+   tdq_load_rem(tdq, td);
+}
+
  /*
   * Fetch cpu utilization information.  Updates on demand.
   */
diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c
index 31d16fe..d1d68e9 100644
--- a/sys/kern/subr_turnstile.c
+++ b/sys/kern/subr_turnstile.c
@@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread *owner,
int queue)
LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash);
}
thread_lock(td);
+#if defined(SCHED_ULE)  defined(SMP)
+   /*
+* Remove load earlier so that it does not affect cpu selection
+* for a thread waken up due to priority lending, if any.
+*/
+   sched_load_rem(td);
+#endif
thread_lock_set(td, ts-ts_lock);
td-td_turnstile = NULL;

diff --git a/sys/sys/sched.h b/sys/sys/sched.h
index 4b8387c..b1ead1b 100644
--- a/sys/sys/sched.h
+++ b/sys/sys/sched.h
@@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td);
  void  sched_add(struct thread *td, int flags);
  void  sched_clock(struct thread *td);
  void  sched_rem(struct thread *td);
+#if defined(SCHED_ULE)  defined(SMP)
+void   sched_load_rem(struct thread *td);
+#endif
  void  sched_tick(int cnt);
  void  sched_relinquish(struct thread *td);
  struct thread *sched_choose(void);



I found another scenario in taskqueue, in the function
taskqueue_terminate, current thread tries to wake
another thread up and sleep immediately, the tq_mutex sometimes
is a spinlock. So if you remove one thread load from current cpu
before wakeup, the resumed thread may be put on same cpu,
so it will optimize the cpu scheduling too.

/*
 * Signal a taskqueue thread to terminate.
 */
static void
taskqueue_terminate(struct thread **pp, 

Re: ule+smp: small optimization for turnstile priority lending

2012-09-20 Thread Andriy Gapon
on 19/09/2012 10:33 Attilio Rao said the following:
 It is hard for me to tell if this is subject to same issues because I
 do not entirely understand where the locking was happening in your
 patch.
 Can you try testing this with your already KTR instrumented test and
 possibly report?

The patch works well as far as I can tell.  Thank you!
There is one warning with full witness enables but it appears to be harmless (so
far):

Acquiring duplicate lock of same type: sched lock
 1st sched lock 1 @ /usr/src/sys/kern/subr_turnstile.c:212
 2nd sched lock 0 @ /usr/src/sys/kern/sched_ule.c:1244
KDB: stack backtrace:
db_trace_self_wrapper() at 0x802d238a = db_trace_self_wrapper+0x2a
kdb_backtrace() at 0x80555f4a = kdb_backtrace+0x3a
_witness_debugger() at 0x8056e2bc = _witness_debugger+0x2c
witness_checkorder() at 0x8056f759 = witness_checkorder+0x579
_mtx_lock_spin_flags() at 0x80504bcd = _mtx_lock_spin_flags+0x10d
sched_pickcpu() at 0x80547829 = sched_pickcpu+0x199
sched_add() at 0x80547bdb = sched_add+0x14b
sched_thread_priority() at 0x80548059 = sched_thread_priority+0x1c9
sched_lend_prio() at 0x80548344 = sched_lend_prio+0x14
propagate_priority() at 0x8056801e = propagate_priority+0x1ce
turnstile_wait() at 0x8056959f = turnstile_wait+0x4ef
_mtx_lock_sleep() at 0x805045f6 = _mtx_lock_sleep+0x486
_mtx_lock_flags() at 0x80504814 = _mtx_lock_flags+0x104
lock_mtx() at 0x805049ca = lock_mtx+0x1a
_sleep() at 0x80524589 = _sleep+0x4f9
taskqueue_thread_loop() at 0x805664c8 = taskqueue_thread_loop+0xa8
fork_exit() at 0x804e5d6a = fork_exit+0x1aa
fork_trampoline() at 0x806ea2ce = fork_trampoline+0xe


-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-09-20 Thread Andriy Gapon
on 20/09/2012 14:04 Andriy Gapon said the following:
 on 19/09/2012 10:33 Attilio Rao said the following:
 It is hard for me to tell if this is subject to same issues because I
 do not entirely understand where the locking was happening in your
 patch.
 Can you try testing this with your already KTR instrumented test and
 possibly report?
 
 The patch works well as far as I can tell.  Thank you!

After more testing it seems that the idea was not complete.  While loads are set
better tdq_lowpri (which is generally = newpri) on the current CPU still
prevents the boosted thread from taking advantage of the lowered load.
Not sure how to refresh tdq_lowpri in this scenario...

 There is one warning with full witness enables but it appears to be harmless 
 (so
 far):
 
 Acquiring duplicate lock of same type: sched lock
  1st sched lock 1 @ /usr/src/sys/kern/subr_turnstile.c:212
  2nd sched lock 0 @ /usr/src/sys/kern/sched_ule.c:1244
 KDB: stack backtrace:
 db_trace_self_wrapper() at 0x802d238a = db_trace_self_wrapper+0x2a
 kdb_backtrace() at 0x80555f4a = kdb_backtrace+0x3a
 _witness_debugger() at 0x8056e2bc = _witness_debugger+0x2c
 witness_checkorder() at 0x8056f759 = witness_checkorder+0x579
 _mtx_lock_spin_flags() at 0x80504bcd = _mtx_lock_spin_flags+0x10d
 sched_pickcpu() at 0x80547829 = sched_pickcpu+0x199
 sched_add() at 0x80547bdb = sched_add+0x14b
 sched_thread_priority() at 0x80548059 = sched_thread_priority+0x1c9
 sched_lend_prio() at 0x80548344 = sched_lend_prio+0x14
 propagate_priority() at 0x8056801e = propagate_priority+0x1ce
 turnstile_wait() at 0x8056959f = turnstile_wait+0x4ef
 _mtx_lock_sleep() at 0x805045f6 = _mtx_lock_sleep+0x486
 _mtx_lock_flags() at 0x80504814 = _mtx_lock_flags+0x104
 lock_mtx() at 0x805049ca = lock_mtx+0x1a
 _sleep() at 0x80524589 = _sleep+0x4f9
 taskqueue_thread_loop() at 0x805664c8 = taskqueue_thread_loop+0xa8
 fork_exit() at 0x804e5d6a = fork_exit+0x1aa
 fork_trampoline() at 0x806ea2ce = fork_trampoline+0xe
 
 


-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-09-20 Thread Andriy Gapon
on 20/09/2012 14:04 Andriy Gapon said the following:
 on 19/09/2012 10:33 Attilio Rao said the following:
 It is hard for me to tell if this is subject to same issues because I
 do not entirely understand where the locking was happening in your
 patch.
 Can you try testing this with your already KTR instrumented test and
 possibly report?
 
 The patch works well as far as I can tell.  Thank you!

After more testing it seems that the idea was not complete.  While loads are set
better tdq_lowpri (which is generally = newpri) on the current CPU still
prevents the boosted thread from taking advantage of the lowered load.
Not sure how to refresh tdq_lowpri in this scenario...

 There is one warning with full witness enables but it appears to be harmless 
 (so
 far):
 
 Acquiring duplicate lock of same type: sched lock
  1st sched lock 1 @ /usr/src/sys/kern/subr_turnstile.c:212
  2nd sched lock 0 @ /usr/src/sys/kern/sched_ule.c:1244
 KDB: stack backtrace:
 db_trace_self_wrapper() at 0x802d238a = db_trace_self_wrapper+0x2a
 kdb_backtrace() at 0x80555f4a = kdb_backtrace+0x3a
 _witness_debugger() at 0x8056e2bc = _witness_debugger+0x2c
 witness_checkorder() at 0x8056f759 = witness_checkorder+0x579
 _mtx_lock_spin_flags() at 0x80504bcd = _mtx_lock_spin_flags+0x10d
 sched_pickcpu() at 0x80547829 = sched_pickcpu+0x199
 sched_add() at 0x80547bdb = sched_add+0x14b
 sched_thread_priority() at 0x80548059 = sched_thread_priority+0x1c9
 sched_lend_prio() at 0x80548344 = sched_lend_prio+0x14
 propagate_priority() at 0x8056801e = propagate_priority+0x1ce
 turnstile_wait() at 0x8056959f = turnstile_wait+0x4ef
 _mtx_lock_sleep() at 0x805045f6 = _mtx_lock_sleep+0x486
 _mtx_lock_flags() at 0x80504814 = _mtx_lock_flags+0x104
 lock_mtx() at 0x805049ca = lock_mtx+0x1a
 _sleep() at 0x80524589 = _sleep+0x4f9
 taskqueue_thread_loop() at 0x805664c8 = taskqueue_thread_loop+0xa8
 fork_exit() at 0x804e5d6a = fork_exit+0x1aa
 fork_trampoline() at 0x806ea2ce = fork_trampoline+0xe

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-09-20 Thread Attilio Rao
On 9/20/12, Andriy Gapon a...@freebsd.org wrote:
 on 19/09/2012 10:33 Attilio Rao said the following:
 It is hard for me to tell if this is subject to same issues because I
 do not entirely understand where the locking was happening in your
 patch.
 Can you try testing this with your already KTR instrumented test and
 possibly report?

 The patch works well as far as I can tell.  Thank you!
 There is one warning with full witness enables but it appears to be harmless
 (so
 far):

Andiy,
thanks a lot for your testing and reports you made so far.
Unfortunately I'm going off for 2 weeks now and I won't work on
FreeBSD for that timeframe. I will get back to those in 2 weeks then.
If you want  to play more with this idea feel free to extend/fix/etc.
this patch.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-09-19 Thread Andriy Gapon
on 19/09/2012 02:01 Attilio Rao said the following:
 On Tue, Sep 18, 2012 at 8:07 PM, Andriy Gapon a...@freebsd.org wrote:
 on 18/09/2012 19:50 Attilio Rao said the following:
 On 9/18/12, Andriy Gapon a...@freebsd.org wrote:

 Here is a snippet that demonstrates the issue on a supposedly fully loaded
 2-processor system:

 136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:running, attributes: prio:122

 136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
 111916,
 state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

 136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid
 14,
 state:running, attributes: prio:255

 136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
 counter:0,
 attributes: none

 136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

 136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
 counter:2,
 attributes: none

 136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
 113473,
 point:wokeup, attributes: linkedto:Xorg tid 102818

 136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq add, attributes: prio:122, linkedto:firefox tid 113473

 136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
 counter:1,
 attributes: none

 136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq rem, attributes: prio:176

 136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818,
 point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid
 113473

 136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:running, attributes: prio:104

 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
 preempted
 cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero
 load.

 I think that the idea is bright, but I have reservations against the
 implementation because it seems to me there are too many layering
 violations.

 Just one - for a layer between tunrstile and scheduler :-)
 But I agree.

 What is suggest is somewhat summarized like that:
 - Add a new SRQ_WILLSLEEP or the name you prefer
 - Add a new flags argument to sched_lend_prio() (both ule and 4bsd)
 and sched_thread_priority (ule only)
 - sched_thread_priority() will pass down the new flag to sched_add()
 which passed down to sched_pickcpu().

 This way sched_pickcpu() has the correct knowledge of what is going on
 and it can make the right decision. You likely don't need to lower the
 tdq_load at that time either this way, because sched_pickcpu() can
 just adjust it locally for its decision.

 What do you think?

 This sounds easy but it is not quite so given the implementation of
 sched_pickcpu and sched_lowest.  This is probably more work than I am able to
 take now.
 
 I think actually that the attached patch should do what you need. Of
 course there is more runqueue locking, due to the tdq_load fondling,
 but I'm not sure it is really a big deal.
 I didn't test it yet, as I understand you have a test case, so maybe
 you can. However if Jeff agrees I can send the patch to flo@ for
 performance evaluation.
 
 Thoughts?

Originally I had a similar patch with a small difference that I did tdq_load
manipulations in sched_thread_priority around sched_add call.  That patch
produced a deadlock because of the need for TDQ_LOCK.

 Index: sys/sys/sched.h
 ===
 --- sys/sys/sched.h (revisione 240672)
 +++ sys/sys/sched.h (copia locale)
 @@ -91,7 +91,7 @@ void  sched_nice(struct proc *p, int nice);
   */
  void   sched_exit_thread(struct thread *td, struct thread *child);
  void   sched_fork_thread(struct thread *td, struct thread *child);
 -void   sched_lend_prio(struct thread *td, u_char prio);
 +void   sched_lend_prio(struct thread *td, u_char prio, int flags);
  void   sched_lend_user_prio(struct thread *td, u_char pri);
  fixpt_tsched_pctcpu(struct thread *td);
  void   sched_prio(struct thread *td, u_char prio);
 @@ -161,6 +161,7 @@ sched_unpin(void)
  #defineSRQ_INTR0x0004  /* It is probably urgent. */
  #defineSRQ_PREEMPTED   0x0008  /* has been
 preempted.. be kind */
  #defineSRQ_BORROWING   0x0010  /* Priority updated
 due to prio_lend */
 +#defineSRQ_WILLSWITCH  0x0020  /* curthread will be
 switched out */
 
  /* Scheduler stats. */
  #ifdef SCHED_STATS
 Index: sys/kern/sched_ule.c
 ===
 --- sys/kern/sched_ule.c(revisione 240672)
 +++ sys/kern/sched_ule.c(copia locale)
 @@ -290,7 +290,7 @@ static struct tdq   tdq_cpu;
  #defineTDQ_LOCKPTR(t)  ((t)-tdq_lock)
 
  static void sched_priority(struct thread *);
 -static void sched_thread_priority(struct thread *, 

Re: ule+smp: small optimization for turnstile priority lending

2012-09-19 Thread Attilio Rao
On Wed, Sep 19, 2012 at 7:03 AM, Andriy Gapon a...@freebsd.org wrote:
 on 19/09/2012 02:01 Attilio Rao said the following:
 On Tue, Sep 18, 2012 at 8:07 PM, Andriy Gapon a...@freebsd.org wrote:
 on 18/09/2012 19:50 Attilio Rao said the following:
 On 9/18/12, Andriy Gapon a...@freebsd.org wrote:

 Here is a snippet that demonstrates the issue on a supposedly fully loaded
 2-processor system:

 136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:running, attributes: prio:122

 136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
 111916,
 state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

 136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid
 14,
 state:running, attributes: prio:255

 136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
 counter:0,
 attributes: none

 136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

 136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
 counter:2,
 attributes: none

 136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
 113473,
 point:wokeup, attributes: linkedto:Xorg tid 102818

 136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq add, attributes: prio:122, linkedto:firefox tid 113473

 136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
 counter:1,
 attributes: none

 136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq rem, attributes: prio:176

 136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818,
 point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid
 113473

 136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:running, attributes: prio:104

 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
 preempted
 cc1plus thread (I do have preemption enabled) only to leave CPU 1 with 
 zero
 load.

 I think that the idea is bright, but I have reservations against the
 implementation because it seems to me there are too many layering
 violations.

 Just one - for a layer between tunrstile and scheduler :-)
 But I agree.

 What is suggest is somewhat summarized like that:
 - Add a new SRQ_WILLSLEEP or the name you prefer
 - Add a new flags argument to sched_lend_prio() (both ule and 4bsd)
 and sched_thread_priority (ule only)
 - sched_thread_priority() will pass down the new flag to sched_add()
 which passed down to sched_pickcpu().

 This way sched_pickcpu() has the correct knowledge of what is going on
 and it can make the right decision. You likely don't need to lower the
 tdq_load at that time either this way, because sched_pickcpu() can
 just adjust it locally for its decision.

 What do you think?

 This sounds easy but it is not quite so given the implementation of
 sched_pickcpu and sched_lowest.  This is probably more work than I am able 
 to
 take now.

 I think actually that the attached patch should do what you need. Of
 course there is more runqueue locking, due to the tdq_load fondling,
 but I'm not sure it is really a big deal.
 I didn't test it yet, as I understand you have a test case, so maybe
 you can. However if Jeff agrees I can send the patch to flo@ for
 performance evaluation.

 Thoughts?

 Originally I had a similar patch with a small difference that I did tdq_load
 manipulations in sched_thread_priority around sched_add call.  That patch
 produced a deadlock because of the need for TDQ_LOCK.

It is hard for me to tell if this is subject to same issues because I
do not entirely understand where the locking was happening in your
patch.
Can you try testing this with your already KTR instrumented test and
possibly report?

Thanks,
Attilio



-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


ule+smp: small optimization for turnstile priority lending

2012-09-18 Thread Andriy Gapon

Here is a snippet that demonstrates the issue on a supposedly fully loaded
2-processor system:

136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818,
state:running, attributes: prio:122

136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid 111916,
state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid 14,
state:running, attributes: prio:255

136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load, counter:0,
attributes: none

136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid 113473,
state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load, counter:2,
attributes: none

136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid 113473,
point:wokeup, attributes: linkedto:Xorg tid 102818

136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818,
state:runq add, attributes: prio:122, linkedto:firefox tid 113473

136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load, counter:1,
attributes: none

136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818,
state:runq rem, attributes: prio:176

136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818,
point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid 113473

136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid 113473,
state:running, attributes: prio:104

See how how the Xorg thread was forced from CPU 1 to CPU 0 where it preempted
cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero 
load.

Here is a proposed solution:

turnstile_wait: optimize priority lending to a thread on a runqueue

As the current thread is definitely going into mi_switch, it now removes
its load before doing priority propagation which can potentially result
in sched_add.  In the SMP  ULE case the latter searches for the
least loaded CPU to place a boosted thread, which is supposedly about
to run.

diff --git a/sys/kern/sched_ule.c b/sys/kern/sched_ule.c
index 8e466cd..3299cae 100644
--- a/sys/kern/sched_ule.c
+++ b/sys/kern/sched_ule.c
@@ -1878,7 +1878,10 @@ sched_switch(struct thread *td, struct thread *newtd, int
flags)
/* This thread must be going to sleep. */
TDQ_LOCK(tdq);
mtx = thread_lock_block(td);
-   tdq_load_rem(tdq, td);
+#if defined(SMP)
+   if ((flags  SW_TYPE_MASK) != SWT_TURNSTILE)
+#endif
+   tdq_load_rem(tdq, td);
}
/*
 * We enter here with the thread blocked and assigned to the
@@ -2412,6 +2415,21 @@ sched_rem(struct thread *td)
tdq_setlowpri(tdq, NULL);
 }

+void
+sched_load_rem(struct thread *td)
+{
+   struct tdq *tdq;
+
+   KASSERT(td == curthread,
+   (sched_rem_load: only curthread is supported));
+   KASSERT(td-td_oncpu == td-td_sched-ts_cpu,
+   (thread running on cpu different from ts_cpu));
+   tdq = TDQ_CPU(td-td_sched-ts_cpu);
+   TDQ_LOCK_ASSERT(tdq, MA_OWNED);
+   MPASS(td-td_lock == TDQ_LOCKPTR(tdq));
+   tdq_load_rem(tdq, td);
+}
+
 /*
  * Fetch cpu utilization information.  Updates on demand.
  */
diff --git a/sys/kern/subr_turnstile.c b/sys/kern/subr_turnstile.c
index 31d16fe..d1d68e9 100644
--- a/sys/kern/subr_turnstile.c
+++ b/sys/kern/subr_turnstile.c
@@ -731,6 +731,13 @@ turnstile_wait(struct turnstile *ts, struct thread *owner,
int queue)
LIST_INSERT_HEAD(ts-ts_free, td-td_turnstile, ts_hash);
}
thread_lock(td);
+#if defined(SCHED_ULE)  defined(SMP)
+   /*
+* Remove load earlier so that it does not affect cpu selection
+* for a thread waken up due to priority lending, if any.
+*/
+   sched_load_rem(td);
+#endif
thread_lock_set(td, ts-ts_lock);
td-td_turnstile = NULL;

diff --git a/sys/sys/sched.h b/sys/sys/sched.h
index 4b8387c..b1ead1b 100644
--- a/sys/sys/sched.h
+++ b/sys/sys/sched.h
@@ -110,6 +110,9 @@ voidsched_preempt(struct thread *td);
 void   sched_add(struct thread *td, int flags);
 void   sched_clock(struct thread *td);
 void   sched_rem(struct thread *td);
+#if defined(SCHED_ULE)  defined(SMP)
+void   sched_load_rem(struct thread *td);
+#endif
 void   sched_tick(int cnt);
 void   sched_relinquish(struct thread *td);
 struct thread *sched_choose(void);

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-09-18 Thread Attilio Rao
On 9/18/12, Andriy Gapon a...@freebsd.org wrote:

 Here is a snippet that demonstrates the issue on a supposedly fully loaded
 2-processor system:

 136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:running, attributes: prio:122

 136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
 111916,
 state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

 136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid
 14,
 state:running, attributes: prio:255

 136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
 counter:0,
 attributes: none

 136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

 136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
 counter:2,
 attributes: none

 136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
 113473,
 point:wokeup, attributes: linkedto:Xorg tid 102818

 136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq add, attributes: prio:122, linkedto:firefox tid 113473

 136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
 counter:1,
 attributes: none

 136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq rem, attributes: prio:176

 136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818,
 point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid
 113473

 136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:running, attributes: prio:104

 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
 preempted
 cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero
 load.

I think that the idea is bright, but I have reservations against the
implementation because it seems to me there are too many layering
violations.

What is suggest is somewhat summarized like that:
- Add a new SRQ_WILLSLEEP or the name you prefer
- Add a new flags argument to sched_lend_prio() (both ule and 4bsd)
and sched_thread_priority (ule only)
- sched_thread_priority() will pass down the new flag to sched_add()
which passed down to sched_pickcpu().

This way sched_pickcpu() has the correct knowledge of what is going on
and it can make the right decision. You likely don't need to lower the
tdq_load at that time either this way, because sched_pickcpu() can
just adjust it locally for its decision.

What do you think?

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-09-18 Thread Andriy Gapon
on 18/09/2012 19:50 Attilio Rao said the following:
 On 9/18/12, Andriy Gapon a...@freebsd.org wrote:

 Here is a snippet that demonstrates the issue on a supposedly fully loaded
 2-processor system:

 136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:running, attributes: prio:122

 136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
 111916,
 state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

 136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid
 14,
 state:running, attributes: prio:255

 136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
 counter:0,
 attributes: none

 136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

 136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
 counter:2,
 attributes: none

 136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
 113473,
 point:wokeup, attributes: linkedto:Xorg tid 102818

 136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq add, attributes: prio:122, linkedto:firefox tid 113473

 136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
 counter:1,
 attributes: none

 136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq rem, attributes: prio:176

 136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818,
 point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid
 113473

 136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:running, attributes: prio:104

 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
 preempted
 cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero
 load.
 
 I think that the idea is bright, but I have reservations against the
 implementation because it seems to me there are too many layering
 violations.

Just one - for a layer between tunrstile and scheduler :-)
But I agree.

 What is suggest is somewhat summarized like that:
 - Add a new SRQ_WILLSLEEP or the name you prefer
 - Add a new flags argument to sched_lend_prio() (both ule and 4bsd)
 and sched_thread_priority (ule only)
 - sched_thread_priority() will pass down the new flag to sched_add()
 which passed down to sched_pickcpu().
 
 This way sched_pickcpu() has the correct knowledge of what is going on
 and it can make the right decision. You likely don't need to lower the
 tdq_load at that time either this way, because sched_pickcpu() can
 just adjust it locally for its decision.
 
 What do you think?

This sounds easy but it is not quite so given the implementation of
sched_pickcpu and sched_lowest.  This is probably more work than I am able to
take now.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-09-18 Thread Jeff Roberson

On Tue, 18 Sep 2012, Andriy Gapon wrote:


on 18/09/2012 19:50 Attilio Rao said the following:

On 9/18/12, Andriy Gapon a...@freebsd.org wrote:


Here is a snippet that demonstrates the issue on a supposedly fully loaded
2-processor system:

136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818,
state:running, attributes: prio:122

136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
111916,
state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid
14,
state:running, attributes: prio:255

136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
counter:0,
attributes: none

136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
113473,
state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
counter:2,
attributes: none

136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
113473,
point:wokeup, attributes: linkedto:Xorg tid 102818

136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818,
state:runq add, attributes: prio:122, linkedto:firefox tid 113473

136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
counter:1,
attributes: none

136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818,
state:runq rem, attributes: prio:176

136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818,
point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid
113473

136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
113473,
state:running, attributes: prio:104

See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
preempted
cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero
load.


I think that the idea is bright, but I have reservations against the
implementation because it seems to me there are too many layering
violations.


Just one - for a layer between tunrstile and scheduler :-)
But I agree.


What is suggest is somewhat summarized like that:
- Add a new SRQ_WILLSLEEP or the name you prefer
- Add a new flags argument to sched_lend_prio() (both ule and 4bsd)
and sched_thread_priority (ule only)
- sched_thread_priority() will pass down the new flag to sched_add()
which passed down to sched_pickcpu().

This way sched_pickcpu() has the correct knowledge of what is going on
and it can make the right decision. You likely don't need to lower the
tdq_load at that time either this way, because sched_pickcpu() can
just adjust it locally for its decision.

What do you think?


This sounds easy but it is not quite so given the implementation of
sched_pickcpu and sched_lowest.  This is probably more work than I am able to
take now.


I agree with Attillio's assessment.  I have tried to do similar things 
before for ithreads etc.  I think a more generic approach would be good. 
I will put it on my list of things to look at and we'll see who gets time 
first.


Thanks,
Jeff



--
Andriy Gapon


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: ule+smp: small optimization for turnstile priority lending

2012-09-18 Thread Attilio Rao
On Tue, Sep 18, 2012 at 8:07 PM, Andriy Gapon a...@freebsd.org wrote:
 on 18/09/2012 19:50 Attilio Rao said the following:
 On 9/18/12, Andriy Gapon a...@freebsd.org wrote:

 Here is a snippet that demonstrates the issue on a supposedly fully loaded
 2-processor system:

 136794   0 3670427870244462 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:running, attributes: prio:122

 136793   0 3670427870241000 KTRGRAPH group:thread, id:cc1plus tid
 111916,
 state:yielding, attributes: prio:183, wmesg:(null), lockname:(null)

 136792   1 3670427870240829 KTRGRAPH group:thread, id:idle: cpu1 tid
 14,
 state:running, attributes: prio:255

 136791   1 3670427870239520 KTRGRAPH group:load, id:CPU 1 load,
 counter:0,
 attributes: none

 136790   1 3670427870239248 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:blocked, attributes: prio:122, wmesg:(null), lockname:unp_mtx

 136789   1 3670427870237697 KTRGRAPH group:load, id:CPU 0 load,
 counter:2,
 attributes: none

 136788   1 3670427870236394 KTRGRAPH group:thread, id:firefox tid
 113473,
 point:wokeup, attributes: linkedto:Xorg tid 102818

 136787   1 3670427870236145 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq add, attributes: prio:122, linkedto:firefox tid 113473

 136786   1 3670427870235981 KTRGRAPH group:load, id:CPU 1 load,
 counter:1,
 attributes: none

 136785   1 3670427870235707 KTRGRAPH group:thread, id:Xorg tid 102818,
 state:runq rem, attributes: prio:176

 136784   1 3670427870235423 KTRGRAPH group:thread, id:Xorg tid 102818,
 point:prio, attributes: prio:176, new prio:122, linkedto:firefox tid
 113473

 136783   1 3670427870202392 KTRGRAPH group:thread, id:firefox tid
 113473,
 state:running, attributes: prio:104

 See how how the Xorg thread was forced from CPU 1 to CPU 0 where it
 preempted
 cc1plus thread (I do have preemption enabled) only to leave CPU 1 with zero
 load.

 I think that the idea is bright, but I have reservations against the
 implementation because it seems to me there are too many layering
 violations.

 Just one - for a layer between tunrstile and scheduler :-)
 But I agree.

 What is suggest is somewhat summarized like that:
 - Add a new SRQ_WILLSLEEP or the name you prefer
 - Add a new flags argument to sched_lend_prio() (both ule and 4bsd)
 and sched_thread_priority (ule only)
 - sched_thread_priority() will pass down the new flag to sched_add()
 which passed down to sched_pickcpu().

 This way sched_pickcpu() has the correct knowledge of what is going on
 and it can make the right decision. You likely don't need to lower the
 tdq_load at that time either this way, because sched_pickcpu() can
 just adjust it locally for its decision.

 What do you think?

 This sounds easy but it is not quite so given the implementation of
 sched_pickcpu and sched_lowest.  This is probably more work than I am able to
 take now.

I think actually that the attached patch should do what you need. Of
course there is more runqueue locking, due to the tdq_load fondling,
but I'm not sure it is really a big deal.
I didn't test it yet, as I understand you have a test case, so maybe
you can. However if Jeff agrees I can send the patch to flo@ for
performance evaluation.

Thoughts?

Attilio

Index: sys/sys/sched.h
===
--- sys/sys/sched.h (revisione 240672)
+++ sys/sys/sched.h (copia locale)
@@ -91,7 +91,7 @@ void  sched_nice(struct proc *p, int nice);
  */
 void   sched_exit_thread(struct thread *td, struct thread *child);
 void   sched_fork_thread(struct thread *td, struct thread *child);
-void   sched_lend_prio(struct thread *td, u_char prio);
+void   sched_lend_prio(struct thread *td, u_char prio, int flags);
 void   sched_lend_user_prio(struct thread *td, u_char pri);
 fixpt_tsched_pctcpu(struct thread *td);
 void   sched_prio(struct thread *td, u_char prio);
@@ -161,6 +161,7 @@ sched_unpin(void)
 #defineSRQ_INTR0x0004  /* It is probably urgent. */
 #defineSRQ_PREEMPTED   0x0008  /* has been
preempted.. be kind */
 #defineSRQ_BORROWING   0x0010  /* Priority updated
due to prio_lend */
+#defineSRQ_WILLSWITCH  0x0020  /* curthread will be
switched out */

 /* Scheduler stats. */
 #ifdef SCHED_STATS
Index: sys/kern/sched_ule.c
===
--- sys/kern/sched_ule.c(revisione 240672)
+++ sys/kern/sched_ule.c(copia locale)
@@ -290,7 +290,7 @@ static struct tdq   tdq_cpu;
 #defineTDQ_LOCKPTR(t)  ((t)-tdq_lock)

 static void sched_priority(struct thread *);
-static void sched_thread_priority(struct thread *, u_char);
+static void sched_thread_priority(struct thread *, u_char, int flags);
 static int sched_interact_score(struct thread *);
 static void sched_interact_update(struct thread *);
 static void sched_interact_fork(struct thread *);
@@ -1233,6 +1233,18 @@ sched_pickcpu(struct thread *td,