[HACKERS] LWLock optimization for multicore Power machines

Alexander Korotkov Fri, 03 Feb 2017 09:02:30 -0800

Hi everybody!

During FOSDEM/PGDay 2017 developer meeting I said that I have some special
assembly optimization for multicore Power machines.  From the answers of
other hackers I realized following.


   1. There are some big Power machines with PostgreSQL in production use.
   Not as many as Intel, but some of them.
   2. Community could be interested in special assembly optimization for
   Power machines despite cost of maintaining it.

Power processors use specific implementation of atomic operations.  This
implementation is some kind of optimistic locking. 'lwarx' instruction
'reserves index', but that reservation could be broken on 'stwcx', and then
we have to retry.  So, for instance CAS operation on Power processor is a
loop.  So, loop of CAS operations is two level nested loop.  Benchmarks
showed that it becomes real problem for LWLockAttemptLock().  However, one
actually can put arbitrary logic between 'lwarx' and 'stwcx' and make it a
single loop.  The downside is that this logic has to be implemented in
assembly.  See [1] for experiment details.

Results in [1] have a lot of junk which isn't relevant anymore.  This is
why I draw a separate graph.

power8-lwlock-asm-ro.png – results of read-only pgbench test on IBM E880
which have 32 physical cores and 256 virtual thread via SMT.  The curves
have following meaning.
 * 9.5: unpatched PostgreSQL 9.5
 * pinunpin-cas: PostgreSQL 9.5 + earlier version of 48354581
 * pinunpin-lwlock-asm: PostgreSQL 9.5 + earlier version of 48354581 +
LWLock implementation in assembly.

lwlock-power-1.patch – is the patch for assembly implementation of LWLock
which I used that time rebased to current master.

Using assembly in lwlock.c looks rough.  This is why I refactored it by
introducing new atomic operation pg_atomic_fetch_mask_add_u32 (see
lwlock-power-2.patch).  It checks that all masked bits are clear and then
adds to variable.  This atomic have special assembly implementation for
Power, and generic implementation for other platforms with loop of CAS.
Probably we would have other implementations for other architectures in
future.  This level of abstraction is the best I managed to invent.

Unfortunately, I have no big enough Power machine at hand to reproduce that
results.  Actually, I have no Power machine at hand at all.  So,
lwlock-power-2.patch was written "blindly".  I would very appreciate if
someone would help me with testing and benchmarking.

1. https://www.postgresql.org/message-id/CAPpHfdsogj38HTDhNMLE56uJy9N8-
[email protected]

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

lwlock-power-1.patch
Description: Binary data

lwlock-power-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] LWLock optimization for multicore Power machines

Reply via email to