On Jun28, 2011, at 22:18 , Robert Haas wrote:
> On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <[email protected]> wrote:
>> [ testing of various spinlock implementations ]
>
> I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core
> loaner from Nate Boley:
Cool, thanks!
> 100 counter increments per cycle
> worker 1 2 4 8
> 16 32
> time wall user wall user wall user wall user wall
> user wall user
> none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08
> 3.3e-07 1.1e-08 3.4e-07
> atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06
> 1.5e-07 2.3e-06 1.5e-07 4.9e-06
> cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06
> 4.2e-07 6.6e-06 4.5e-07 1.4e-05
> spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06
> 6.1e-05 1.4e-05 4.3e-04
> pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06
> 1.6e-06 2.5e-05 6.4e-06 2.0e-04
> pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06
> 1.9e-07 3.0e-06 2.4e-07 7.5e-06
Here's the same table, formatted with spaces.
worker 1 2 4 8
16 32
time wall user wall user wall user wall user
wall user wall user
none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07
2.1e-08 3.3e-07 1.1e-08 3.4e-07
atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06
1.5e-07 2.3e-06 1.5e-07 4.9e-06
cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06
4.2e-07 6.6e-06 4.5e-07 1.4e-05
spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06
3.8e-06 6.1e-05 1.4e-05 4.3e-04
pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06
1.6e-06 2.5e-05 6.4e-06 2.0e-04
pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06
1.9e-07 3.0e-06 2.4e-07 7.5e-06
And here's the throughput table calculated from your results,
i.e. the wall time per cycle relative to the wall time per cycle
for 1 worker.
workers 2 4 8 16 32
none 1.9 3.5 6.7 13 26
atomicinc 1.4 2.6 2.6 2.4 2.4
cmpxchng 1.1 1.1 1.2 0.9 0.8
spin 1.5 2.6 0.3 0.1 0.0
pg_lwlock 1.4 2.5 1.0 0.2 0.1
pg_lwlock_cas 1.3 2.6 2.6 1.9 1.5
Hm, so in the best case we get 2.6x the throughput of a single core,
and that only for 4 and 8 workers (1.4e-7 second / cycle vs 3.6e-7).
In that case, there also seems to be little difference between
pg_lwlock{_cas} and atomicinc. atomicinc again manages to at least
sustain that throughput when the worker count is increased, while
for for the others the throughput actually *decreases*.
What totally puzzles me is that your results don't show any
trace of a decreased system load for the pg_lwlock implementation,
which I'd have expected due to the sleep() in the contested
path. Here are the user vs. wall time ratios - I'd have expected
to see value significantly below the number of workers for pg_lwlock
none 1.0 2.0 4.0 7.9 16 31
atomicinc 1.0 2.0 3.9 7.9 15 33
cmpxchng 1.0 2.0 4.1 7.9 16 31
spin 1.0 2.0 3.9 7.8 16 31
pg_lwlock 1.0 2.0 4.1 7.9 16 31
pg_lwlock_cas 1.0 2.0 4.1 7.9 16 31
> I wrote a little script to show to reorganize this data in a
> possibly-easier-to-understand format - ordering each column from
> lowest to highest, and showing each algorithm as a multiple of the
> cheapest value for that column:
If you're OK with that, I'd like to add that to the lockbench
repo.
> There seems to be something a bit funky in your 3-core data, but
> overall I read this data to indicate that 4 cores aren't really enough
> to see a severe problem with spinlock contention.
Hm, it starts to show if you lower the counter increment per cycle
(the D constant in run.sh). But yeah, it's never as bad as the
32-core results above..
best regards,
Florian Pflug
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers