On 16.05.2013 01:08, Daniel Farina wrote:
On Mon, May 13, 2013 at 5:50 AM, Heikki Linnakangas
<hlinnakan...@vmware.com>  wrote:
pgbench -S is such a workload. With 9.3beta1, I'm seeing this profile, when
I run "pgbench -S -c64 -j64 -T60 -M prepared" on a 32-core Linux machine:

-  64.09%  postgres  postgres           [.] tas
    - tas
       - 99.83% s_lock
          - 53.22% LWLockAcquire
             + 99.87% GetSnapshotData
          - 46.78% LWLockRelease
               GetSnapshotData
             + GetTransactionSnapshot
+   2.97%  postgres  postgres           [.] tas
+   1.53%  postgres  libc-2.13.so       [.] 0x119873
+   1.44%  postgres  postgres           [.] GetSnapshotData
+   1.29%  postgres  [kernel.kallsyms]  [k] arch_local_irq_enable
+   1.18%  postgres  postgres           [.] AllocSetAlloc
...

So, on this test, a lot of time is wasted spinning on the mutex of
ProcArrayLock. If you plot a graph of TPS vs. # of clients, there is a
surprisingly steep drop in performance once you go beyond 29 clients
(attached, pgbench-lwlock-cas-local-clients-sets.png, red line). My theory
is that after that point all the cores are busy, and processes start to be
sometimes context switched while holding the spinlock, which kills
performance.

I have, I also used linux perf to come to this conclusion, and my
determination was similar: a system was undergoing increasingly heavy
load, in this case with processes>>  number of processors.  It was
also a phase-change type of event: at one moment everything would be
going great, but once a critical threshold was hit, s_lock would
consume enormous amount of CPU time.  I figured preemption while in
the spinlock was to blame at the time, given the extreme nature

Stop the press! I'm getting the same speedup on that 32-core box I got with the compare-and-swap patch, from this one-liner:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -200,6 +200,8 @@ typedef unsigned char slock_t;

 #define TAS(lock) tas(lock)

+#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))
+
 static __inline__ int
 tas(volatile slock_t *lock)
 {

So, on this system, doing a non-locked test before the locked xchg instruction while spinning, is a very good idea. That contradicts the testing that was done earlier when the x86-64 implementation was added, as we have this comment in the tas() implementation:

        /*
         * On Opteron, using a non-locking test before the locking instruction
         * is a huge loss.  On EM64T, it appears to be a wash or small loss,
         * so we needn't bother to try to distinguish the sub-architectures.
         */

On my test system, the non-locking test is a big win. I tested this because I was reading this article from Intel:

http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/. It says very explicitly that the non-locking test is a good idea:

Spinning on volatile read vs. spin on lock attempt

One common mistake made by developers developing their own spin-wait loops is 
attempting to spin on an atomic instruction instead of spinning on a volatile 
read. Spinning on a dirty read instead of attempting to acquire a lock consumes 
less time and resources. This allows an application to only attempt to acquire 
a lock only when it is free.

Now, I'm not sure what to do about this. If we put the non-locking test in there, according to the previous testing that would be a huge loss on Opterons.

Perhaps we should just sleep earlier, ie. lower MAX_SPINS_PER_DELAY. That way, even if each TAS_SPIN test is very expensive, we don't spend too much time spinning if it's really busy, or held by a process that is sleeping.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to