qspinlock: Enhance pvqspinlock

Waiman Long Fri, 30 Oct 2015 16:29:03 -0700

v8->v9:
 - Added a new patch 2 which tried to prefetch the cacheline of the
   next MCS node in order to reduce the MCS unlock latency when it
   was time to do the unlock.
 - Changed the slowpath statistics counters implementation in patch
   4 from atomic_t to per-cpu variables to reduce performance overhead
   and used sysfs instead of debugfs to return the consolidated counts
   and data.


v7->v8:
 - Annotated the use of each _acquire/_release variants in qspinlock.c.
 - Used the available pending bit in the lock stealing patch to disable
   lock stealing when the queue head vCPU is actively spinning on the
   lock to avoid lock starvation.
 - Restructured the lock stealing patch to reduce code duplication.
 - Verified that the waitcnt processing will be compiled away if
   QUEUED_LOCK_STAT isn't enabled.

v6->v7:
 - Removed arch/x86/include/asm/qspinlock.h from patch 1.
 - Removed the unconditional PV kick patch as it has been merged
   into tip.
 - Changed the pvstat_inc() API to add a new condition parameter.
 - Added comments and rearrange code in patch 4 to clarify where
   lock stealing happened.
 - In patch 5, removed the check for pv_wait count when deciding when
   to wait early.
 - Updated copyrights and email address.

v5->v6:
 - Added a new patch 1 to relax the cmpxchg and xchg operations in
   the native code path to reduce performance overhead on non-x86
   architectures.
 - Updated the unconditional PV kick patch as suggested by PeterZ.
 - Added a new patch to allow one lock stealing attempt at slowpath
   entry point to reduce performance penalty due to lock waiter
   preemption.
 - Removed the pending bit and kick-ahead patches as they didn't show
   any noticeable performance improvement on top of the lock stealing
   patch.
 - Simplified the adaptive spinning patch as the lock stealing patch
   allows more aggressive pv_wait() without much performance penalty
   in non-overcommitted VMs.

v4->v5:
 - Rebased the patch to the latest tip tree.
 - Corrected the comments and commit log for patch 1.
 - Removed the v4 patch 5 as PV kick deferment is no longer needed with
   the new tip tree.
 - Simplified the adaptive spinning patch (patch 6) & improve its
   performance a bit further.
 - Re-ran the benchmark test with the new patch.

v3->v4:
 - Patch 1: add comment about possible racing condition in PV unlock.
 - Patch 2: simplified the pv_pending_lock() function as suggested by
   Davidlohr.
 - Move PV unlock optimization patch forward to patch 4 & rerun
   performance test.

v2->v3:
 - Moved deferred kicking enablement patch forward & move back
   the kick-ahead patch to make the effect of kick-ahead more visible.
 - Reworked patch 6 to make it more readable.
 - Reverted back to use state as a tri-state variable instead of
   adding an additional bistate variable.
 - Added performance data for different values of PV_KICK_AHEAD_MAX.
 - Add a new patch to optimize PV unlock code path performance.

v1->v2:
 - Take out the queued unfair lock patches
 - Add a patch to simplify the PV unlock code
 - Move pending bit and statistics collection patches to the front
 - Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
   when appropriate.
 - Change the wait-early patch to use adaptive spinning to better
   balance the difference effect on normal and over-committed guests.
 - Add patch-to-patch performance changes in the patch commit logs.

This patchset tries to improve the performance of both regular and
over-commmitted VM guests. The adaptive spinning patch was inspired
by the "Do Virtual Machines Really Scale?" blog from Sanidhya Kashyap.

Patch 1 relaxes the memory order restriction of atomic operations by
using less restrictive _acquire and _release variants of cmpxchg()
and xchg(). This will reduce performance overhead when ported to other
non-x86 architectures.

Patch 2 attempts to prefetch the cacheline of the next MCS node to
reduce latency in the MCS unlock operation.

Patch 3 optimizes the PV unlock code path performance for x86-64
architecture.

Patch 4 allows the collection of various slowpath statistics counter
data that are useful to see what is happening in the system. Per-cpu
counters are used to minimize performance overhead.

Patch 5 allows one lock stealing attempt at slowpath entry. This causes
a pretty big performance improvement for over-committed VM guests.

Patch 6 enables adaptive spinning in the queue nodes. This patch
leads to further performance improvement in over-committed guest,
though it is not as big as the previous patch.

Waiman Long (6):
  locking/qspinlock: Use _acquire/_release versions of cmpxchg & xchg
  locking/qspinlock: prefetch next node cacheline
  locking/pvqspinlock, x86: Optimize PV unlock code path
  locking/pvqspinlock: Collect slowpath lock statistics
  locking/pvqspinlock: Allow 1 lock stealing attempt
  locking/pvqspinlock: Queue node adaptive spinning

 arch/x86/Kconfig                          |    8 +
 arch/x86/include/asm/qspinlock_paravirt.h |   59 ++++++
 include/asm-generic/qspinlock.h           |    9 +-
 kernel/locking/qspinlock.c                |   99 +++++++---
 kernel/locking/qspinlock_paravirt.h       |  221 +++++++++++++++++----
 kernel/locking/qspinlock_stat.h           |  310 +++++++++++++++++++++++++++++
 6 files changed, 638 insertions(+), 68 deletions(-)
 create mode 100644 kernel/locking/qspinlock_stat.h

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH tip/locking/core v9 0/6] locking/qspinlock: Enhance pvqspinlock

Reply via email to