On 04/09/2018 06:58 AM, Will Deacon wrote: > Hi Waiman, > > Thanks for taking this lot for a spin. Comments and questions below. > > On Fri, Apr 06, 2018 at 04:50:19PM -0400, Waiman Long wrote: >> On 04/05/2018 12:58 PM, Will Deacon wrote: >>> The qspinlock locking slowpath utilises a "pending" bit as a simple form >>> of an embedded test-and-set lock that can avoid the overhead of explicit >>> queuing in cases where the lock is held but uncontended. This bit is >>> managed using a cmpxchg loop which tries to transition the uncontended >>> lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1). >>> >>> Unfortunately, the cmpxchg loop is unbounded and lockers can be starved >>> indefinitely if the lock word is seen to oscillate between unlocked >>> (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are >>> able to take the lock in the cmpxchg loop without queuing and pass it >>> around amongst themselves. >>> >>> This patch fixes the problem by unconditionally setting _Q_PENDING_VAL >>> using atomic_fetch_or, and then inspecting the old value to see whether >>> we need to spin on the current lock owner, or whether we now effectively >>> hold the lock. The tricky scenario is when concurrent lockers end up >>> queuing on the lock and the lock becomes available, causing us to see >>> a lockword of (n,0,0). With pending now set, simply queuing could lead >>> to deadlock as the head of the queue may not have observed the pending >>> flag being cleared. Conversely, if the head of the queue did observe >>> pending being cleared, then it could transition the lock from (n,0,0) -> >>> (0,0,1) meaning that any attempt to "undo" our setting of the pending >>> bit could race with a concurrent locker trying to set it. >>> >>> We handle this race by preserving the pending bit when taking the lock >>> after reaching the head of the queue and leaving the tail entry intact >>> if we saw pending set, because we know that the tail is going to be >>> updated shortly. >>> >>> Cc: Peter Zijlstra <pet...@infradead.org> >>> Cc: Ingo Molnar <mi...@kernel.org> >>> Signed-off-by: Will Deacon <will.dea...@arm.com> >>> --- >> The pending bit was added to the qspinlock design to counter performance >> degradation compared with ticket lock for workloads with light >> spinlock contention. I run my spinlock stress test on a Intel Skylake >> server running the vanilla 4.16 kernel vs a patched kernel with this >> patchset. The locking rates with different number of locking threads >> were as follows: >> >> # of threads 4.16 kernel patched 4.16 kernel >> ------------ ----------- ------------------- >> 1 7,417 kop/s 7,408 kop/s >> 2 5,755 kop/s 4,486 kop/s >> 3 4,214 kop/s 4,169 kop/s >> 4 4,396 kop/s 4,383 kop/s >> >> The 2 contending threads case is the one that exercise the pending bit >> code path the most. So it is obvious that this is the one that is most >> impacted by this patchset. The differences in the other cases are mostly >> noise or maybe just a little bit on the 3 contending threads case. > That is bizarre. A few questions: > > 1. Is this with my patches as posted, or also with your WRITE_ONCE change?
This is just the with your patches as posted. > 2. Could you try to bisect my series to see which patch is responsible > for this degradation, please? I have done further analysis with the help of CONFIG_QUEUED_LOCK_STAT with another patch to enable counting the pending and the queuing code paths. Running the 2-thread test with the original qspinlock code on a Haswell server, the performance data were pending count = 3,265,220 queuing count = 22 locking rate = 11,648 kop/s With your posted patches, pending count = 330 queuing count = 9,965,127 locking rate = 4,178 kop/s I believe that my test case has heavy dependency on _Q_PENDING_VAL spinning loop. When I added back the loop, the performance data became: pending count = 3,278,320 queuing count = 0 locking rate = 11,884 kop/s Instead of an infinite loop, I also tried a limited spin with loop count of 0x200 and I got similar performance data as the infinite loop case. > 3. Could you point me at your stress test, so I can try to reproduce these > numbers on arm64 systems, please? I will send you the test that I used in a separate email. >> I am not against this patch, but we certainly need to find out a way to >> bring the performance number up closer to what it is before applying >> the patch. > We certainly need to *understand* where the drop is coming from, because > the two-threaded case is still just a CAS on x86 with and without this > patch series. Generally, there's a throughput cost when ensuring fairness > and forward-progress otherwise we'd all be using test-and-set. As stated above, the drop comes mainly from skipping the _Q_PENDING_VAL spinning loop. I supposed that if we just do a limited spin, we can still ensure forward progress while preserving the performance profile of the original qspinlock code. I don't think other codes in your patches cause any performance regression as far as my testing is concerned. Cheers, Longman