Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

Waiman Long Thu, 22 Sep 2016 14:49:30 -0700

On 09/22/2016 04:38 PM, Thomas Gleixner wrote:

On Thu, 22 Sep 2016, Waiman Long wrote:

BTW, my initial attempt for the new futex was to use the same workflow as the
PI futexes, but use mutex which has optimistic spinning instead of rt_mutex.
That version can double the throughput compared with PI futexes but still far
short of what can be achieved with wait-wake futex. Looking at the performance
figures from the patch:


                 wait-wake futex     PI futex        TO futex
                 ---------------     --------        --------
max time            3.49s            50.91s          2.65s
min time            3.24s            50.84s          0.07s
average time        3.41s            50.90s          1.84s
sys time          7m22.4s            55.73s        2m32.9s

That's really interesting. Do you have any explanation for this massive
system time differences?

For the wait-wake futexes, the waiters will be kicked out with EAGAINwithout sleeping as long as the futex value changes before they acquirethe hash bucket spinlock. These waiters will attempt to acquire the lockagain in the user space. If that fails, they will call FUTEX_WAIT again.

In the above test case, the critical section is pretty short and about4/5 of the FUTEX_WAIT calls resulted in an EAGAIN error. So there are alot more user-space/kernel transitions than the TO futexes. I think thedifference in the number of FUTEX_WAIT calls vs. the FUTEX_LOCK_TO callscauses the difference between their sys times. As for the PI futex, itis a strictly sleeping lock and so won't use that much sys time.

lock count       3,090,294          9,999,813       698,318
unlock count     3,268,896          9,999,814           134

The problem with a PI futexes like version is that almost all the lock/unlock
operations were done in the kernel which added overhead and latency. Now
looking at the numbers for the TO futexes, less than 1/10 of the lock
operations were done in the kernel, the number of unlock was insignificant.
Locking was done mostly by lock stealing. This is where most of the
performance benefit comes from, not optimistic spinning.

How does the lock latency distribution of all this look like and how fair
is the whole thing?

The TO futexes are unfair as can be seen from the min/max thread timeslisted above. It took the fastest thread 0.07s to complete all thelocking operations, whereas the slowest one needed 2.65s. However, thesituation reverses when I changed the critical section to a 1us sleep.In this case, there will be no optimistic spinning. The performanceresults for 100k locking operations were listed below.


                wait-wake futex     PI futex        TO futex
                ---------------     --------        --------
max time            0.06s             9.32s          4.76s
min time            5.59s             9.36s          5.62s
average time        3.25s             9.35s          5.41s

In this case, the TO futexes are fairer but perform worse than thewait-wake futexes. That is because the lock handoff mechanism limit theamount of lock stealing in the TO futexes while the wait-wake futexeshave no such restriction. When I disabled lock handoff, the TO futexeswould then perform similar to the wait-wake futexes.

This is also the reason that a lock handoff mechanism is implemented to
prevent lock starvation which is likely to happen without one.

Where is that lock handoff mechanism?

In the futex_state object, there is a new field handoff_pid. It is setwhen the threshold count in futex_spin_on_owner() becomes negative Whenthis field is set, the unlocker will change the futex word to that valueinstead of clearing it to 0 and others can steal it. I will separate outthe lock handoff in a separate patch in the next revision to highlight it.


Cheers,
Longman

Re: [RFC PATCH v2 3/5] futex: Throughput-optimized (TO) futexes

Reply via email to