On 10/20/2016 09:36 AM, Dilip Kumar wrote:
On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmh...@gmail.com> wrote:
I agree with these conclusions. I had a chance to talk with Andres
this morning at Postgres Vision and based on that conversation I'd
like to suggest a couple of additional tests:
1. Repeat this test on x86. In particular, I think you should test on
the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
I have done my test on cthulhu, basic difference is that In POWER we
saw ClogControlLock on top at 96 and more client with 300 scale
factor. But, on cthulhu at 300 scale factor transactionid lock is
always on top. So I repeated my test with 1000 scale factor as well on
cthulhu.
All configuration is same as my last test.
Test with 1000 scale factor
-------------------------------------
Test1: number of clients: 192
Head:
tps = 21206.108856 (including connections establishing)
tps = 21206.245441 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt
310489 LWLockNamed | CLogControlLock
296152 |
35537 Lock | transactionid
15821 LWLockTranche | buffer_mapping
10342 LWLockTranche | buffer_content
8427 LWLockTranche | clog
3961
3165 Lock | extend
2861 Lock | tuple
2781 LWLockNamed | ProcArrayLock
1104 LWLockNamed | XidGenLock
745 LWLockTranche | lock_manager
371 LWLockNamed | CheckpointerCommLock
70 LWLockTranche | wal_insert
5 BufferPin | BufferPin
3 LWLockTranche | proc
Patch:
tps = 28725.038933 (including connections establishing)
tps = 28725.367102 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt
540061 |
57810 LWLockNamed | CLogControlLock
36264 LWLockTranche | buffer_mapping
29976 Lock | transactionid
4770 Lock | extend
4735 LWLockTranche | clog
4479 LWLockNamed | ProcArrayLock
4006
3955 LWLockTranche | buffer_content
2505 LWLockTranche | lock_manager
2179 Lock | tuple
1977 LWLockNamed | XidGenLock
905 LWLockNamed | CheckpointerCommLock
222 LWLockTranche | wal_insert
8 LWLockTranche | proc
Test2: number of clients: 96
Head:
tps = 25447.861572 (including connections establishing)
tps = 25448.012739 (excluding connections establishing)
261611 |
69604 LWLockNamed | CLogControlLock
6119 Lock | transactionid
4008
2874 LWLockTranche | buffer_mapping
2578 LWLockTranche | buffer_content
2355 LWLockNamed | ProcArrayLock
1245 Lock | extend
1168 LWLockTranche | clog
232 Lock | tuple
217 LWLockNamed | CheckpointerCommLock
160 LWLockNamed | XidGenLock
158 LWLockTranche | lock_manager
78 LWLockTranche | wal_insert
5 BufferPin | BufferPin
Patch:
tps = 32708.368938 (including connections establishing)
tps = 32708.765989 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_96_ul.txt
326601 |
7471 LWLockNamed | CLogControlLock
5387 Lock | transactionid
4018
3331 LWLockTranche | buffer_mapping
3144 LWLockNamed | ProcArrayLock
1372 Lock | extend
722 LWLockTranche | buffer_content
393 LWLockNamed | XidGenLock
237 LWLockTranche | lock_manager
234 Lock | tuple
194 LWLockTranche | clog
96 Lock | relation
88 LWLockTranche | wal_insert
34 LWLockNamed | CheckpointerCommLock
Test3: number of clients: 64
Head:
tps = 28264.194438 (including connections establishing)
tps = 28264.336270 (excluding connections establishing)
218264 |
10314 LWLockNamed | CLogControlLock
4019
2067 Lock | transactionid
1950 LWLockTranche | buffer_mapping
1879 LWLockNamed | ProcArrayLock
592 Lock | extend
565 LWLockTranche | buffer_content
222 LWLockNamed | XidGenLock
143 LWLockTranche | clog
131 LWLockNamed | CheckpointerCommLock
63 LWLockTranche | lock_manager
52 Lock | tuple
35 LWLockTranche | wal_insert
Patch:
tps = 27906.376194 (including connections establishing)
tps = 27906.531392 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_64_ul.txt
228108 |
4039
2294 Lock | transactionid
2116 LWLockTranche | buffer_mapping
1757 LWLockNamed | ProcArrayLock
1553 LWLockNamed | CLogControlLock
800 Lock | extend
403 LWLockTranche | buffer_content
92 LWLockNamed | XidGenLock
74 LWLockTranche | lock_manager
42 Lock | tuple
35 LWLockTranche | wal_insert
34 LWLockTranche | clog
14 LWLockNamed | CheckpointerCommLock
Test4: number of clients: 32
Head:
tps = 27587.999912 (including connections establishing)
tps = 27588.119611 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
117762 |
4031
614 LWLockNamed | ProcArrayLock
379 LWLockNamed | CLogControlLock
344 Lock | transactionid
183 Lock | extend
102 LWLockTranche | buffer_mapping
71 LWLockTranche | buffer_content
39 LWLockNamed | XidGenLock
25 LWLockTranche | lock_manager
3 LWLockTranche | wal_insert
3 LWLockTranche | clog
2 LWLockNamed | CheckpointerCommLock
2 Lock | tuple
Patch:
tps = 28291.428848 (including connections establishing)
tps = 28291.586435 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
116596 |
4041
757 LWLockNamed | ProcArrayLock
407 LWLockNamed | CLogControlLock
358 Lock | transactionid
183 Lock | extend
142 LWLockTranche | buffer_mapping
77 LWLockTranche | buffer_content
68 LWLockNamed | XidGenLock
35 LWLockTranche | lock_manager
15 LWLockTranche | wal_insert
7 LWLockTranche | clog
7 Lock | tuple
4 LWLockNamed | CheckpointerCommLock
1 Tuples only is on.
Summary:
- At 96 and more clients count we can see ClogControlLock at the top.
- With patch contention on ClogControlLock is reduced significantly.
I think these behaviours are same as we saw on power.
With 300 scale factor:
- Contention on ClogControlLock is significant only at 192 client
(still transaction id lock is on top), Which is completely removed
with group lock patch.
For 300 scale factor, I am posting data only at 192 client count (If
anyone interested in other data I can post).
In the results you've posted on 10/12, you've mentioned a regression
with 32 clients, where you got 52k tps on master but only 48k tps with
the patch (so ~10% difference). I have no idea what scale was used for
those tests, and I see no such regression in the current results (but
you only report results for some of the client counts).
Also, which of the proposed patches have you been testing?
Can you collect and share a more complete set of data, perhaps based on
the scripts I use to do tests on the large machine with 36/72 cores,
available at https://bitbucket.org/tvondra/hp05-results ?
I've taken some time to build a simple web-based reports from the
results collected so far (also included in the git repository), and
pushed them here:
http://tvondra.bitbucket.org
For each of the completed runs, there's a report comparing tps for
different client counts with master and the three patches (average tps,
median and stddev), and it's possible to download a more thorough text
report with wait event stats, comparison of individual runs etc.
If you want to cooperate on this, I'm available - i.e. I can help you
get the tooling running, customize it etc.
Regarding the results collected on the "big machine" so far, I do have a
few observations:
pgbench / scale 300 (fits into 16GB shared buffers)
---------------------------------------------------
* in general, those results seem fine
* the results generally fall into 3 categories (I'll show results for
"pgbench -N" but regular pgbench behaves similarly):
(a) logged, sync_commit=on - no impact
http://tvondra.bitbucket.org/#pgbench-300-logged-sync-skip
(b) logged, sync_commit=off - improvement
http://tvondra.bitbucket.org/#pgbench-300-logged-async-skip
The thoughput gets improved by ~20% with 72 clients, and then it
levels-off (but does not drop unlike on master). With high client
counts the difference is up to 300%, but people who care about
throughput won't run with such client counts anyway.
And not only this improves throughput, it also significantly
reduces variability of the performance (i.e. measure throughput
each second and compute STDDEV of that). You can imagine this
as a much "smoother" chart of tps over time.
(c) unlogged, sync_commit=* - improvement
http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip
This is actually quite similar to (b).
dilip / scale 300 (fits into 16GB shared buffers)
-------------------------------------------------
* those results seem less OK
* I haven't found any significant regressions (in the sense of
significant performance drop compared to master), but the behavior in
some cases seem fairly strange (and it's repeatable)
* consider for example these results:
http://tvondra.bitbucket.org/#dilip-300-unlogged-async
http://tvondra.bitbucket.org/#dilip-300-logged-async
* the saw-like pattern is rather suspicious, and I don't think I've seen
anything like that before - I guess there's some feedback loop and we
better find it before committing any of the patches, because this is
something I don't want to see on any production machine (and I bet
neither do you)
* After looking into wait even details in the full text report at
http://tvondra.bitbucket.org/by-test/dilip-300-unlogged-async.txt
(section "wait events for dilip-300-unlogged-async (runs combined)")
I see that for pg-9.6-group-update, the statistics for 72, 108 and
144 clients (low - high - low), the summary looks like this
clients | wait_event_type | wait_event | wait_count | wait_pct
---------+-----------------+-----------------+------------+----------
72 | | | 374845 | 62.87
72 | Client | ClientRead | 136320 | 22.86
72 | LWLockNamed | CLogControlLock | 52804 | 8.86
72 | LWLockTranche | buffer_content | 15337 | 2.57
72 | LWLockNamed | XidGenLock | 7352 | 1.23
72 | LWLockNamed | ProcArrayLock | 6630 | 1.11
108 | | | 407179 | 46.01
108 | LWLockNamed | CLogControlLock | 300452 | 33.95
108 | LWLockTranche | buffer_content | 87597 | 9.90
108 | Client | ClientRead | 80901 | 9.14
108 | LWLockNamed | ProcArrayLock | 3290 | 0.37
144 | | | 623057 | 53.44
144 | LWLockNamed | CLogControlLock | 175072 | 15.02
144 | Client | ClientRead | 163451 | 14.02
144 | LWLockTranche | buffer_content | 147963 | 12.69
144 | LWLockNamed | XidGenLock | 38361 | 3.29
144 | Lock | transactionid | 8821 | 0.76
That is, there's sudden jump on CLogControlLock from 22% to 33% and
then back to 15% (and for 180 clients it jumps back to ~35%). That's
pretty strange, and all the patches behave exactly the same.
scale 3000 (45GB), shared_buffers=16GB
---------------------------------------
For the small scale, the whole data set fits into 16GB shared buffers,
so there were pretty much no writes except for WAL and CLOG. For scale
3000 that's no longer true - the backends will compete for buffers and
will constantly write dirty buffers to page cache.
I haven't realized this initially and the kernel was using the default
vm.dirty_* limits, i.e. 10% and 20%. As the machine has 3TB of RAM, this
resulted in rather excessive threshold (or "insane" if you want), so the
kernel regularly accumulated up to ~15GB of dirty data and then wrote it
out in very short period of time. Even though the machine has fairly
powerful storage (4GB write cache on controller, 10 x 12Gbps SAS SSDs),
this lead to pretty bad latency spikes / drops in throughput.
I've only done two runs with this configuration before realizing what's
happening, the results are illustrated here"
* http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-high-dirty-bytes
*
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-high-dirty-bytes
I'm not sure how important those results are (if throughput and smooth
behavior matters, tuning the kernel thresholds is a must), but what I
find interesting is that while the patches manage to improve throughput
by 10-20%, they also (quite significantly) increase variability of the
results (jitter in the tps over time). It's particularly visible on the
pgbench results. I'm not sure that's a good tradeoff.
After fixing the kernel page cache thresholds (by setting
background_bytes to 256MB to perform smooth write-out), the effect
differs depending on the workload:
(a) dilip
http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync
- eliminates any impact of all the patches
(b) pgbench (-N)
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip
- By far the most severe regression observed during the testing.
With 36 clients the throughput drops by ~40%, which I think is
pretty bad. Also the results are much more variable with the
patches (compared to master).
scale 3000 (45GB), shared_buffers=64GB
---------------------------------------
I've also done some tests with increased shared buffers, so that even
the large data set fits into them. Again, the results slightly depend on
the workload:
(a) dilip
* http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-64
* http://tvondra.bitbucket.org/#dilip-3000-unlogged-async-64
Pretty much no impact on throughput or variability. Unlike on the
small data set, it the patches don't even eliminate the performance
drop above 72 clients - the performance closely matches master.
(b) pgbench
* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-64
* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-noskip-64
There's a small benefit (~20% on the same client count), and the
performance drop only happens after 72 clients. The patches also
significantly increase variability of the results, particularly for
large client counts.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers