On 10/20/2016 09:36 AM, Dilip Kumar wrote:
On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmh...@gmail.com> wrote:
I agree with these conclusions.  I had a chance to talk with Andres
this morning at Postgres Vision and based on that conversation I'd
like to suggest a couple of additional tests:

1. Repeat this test on x86.  In particular, I think you should test on
the EnterpriseDB server cthulhu, which is an 8-socket x86 server.

I have done my test on cthulhu, basic difference is that In POWER we
saw ClogControlLock on top at 96 and more client with 300 scale
factor. But, on cthulhu at 300 scale factor transactionid lock is
always on top. So I repeated my test with 1000 scale factor as well on
cthulhu.

All configuration is same as my last test.

Test with 1000 scale factor
-------------------------------------

Test1: number of clients: 192

Head:
tps = 21206.108856 (including connections establishing)
tps = 21206.245441 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt
 310489  LWLockNamed     | CLogControlLock
 296152                  |
  35537  Lock            | transactionid
  15821  LWLockTranche   | buffer_mapping
  10342  LWLockTranche   | buffer_content
   8427  LWLockTranche   | clog
   3961
   3165  Lock            | extend
   2861  Lock            | tuple
   2781  LWLockNamed     | ProcArrayLock
   1104  LWLockNamed     | XidGenLock
    745  LWLockTranche   | lock_manager
    371  LWLockNamed     | CheckpointerCommLock
     70  LWLockTranche   | wal_insert
      5  BufferPin       | BufferPin
      3  LWLockTranche   | proc

Patch:
tps = 28725.038933 (including connections establishing)
tps = 28725.367102 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_192_ul.txt
 540061                  |
  57810  LWLockNamed     | CLogControlLock
  36264  LWLockTranche   | buffer_mapping
  29976  Lock            | transactionid
   4770  Lock            | extend
   4735  LWLockTranche   | clog
   4479  LWLockNamed     | ProcArrayLock
   4006
   3955  LWLockTranche   | buffer_content
   2505  LWLockTranche   | lock_manager
   2179  Lock            | tuple
   1977  LWLockNamed     | XidGenLock
    905  LWLockNamed     | CheckpointerCommLock
    222  LWLockTranche   | wal_insert
      8  LWLockTranche   | proc

Test2: number of clients: 96

Head:
tps = 25447.861572 (including connections establishing)
tps = 25448.012739 (excluding connections establishing)
 261611                  |
  69604  LWLockNamed     | CLogControlLock
   6119  Lock            | transactionid
   4008
   2874  LWLockTranche   | buffer_mapping
   2578  LWLockTranche   | buffer_content
   2355  LWLockNamed     | ProcArrayLock
   1245  Lock            | extend
   1168  LWLockTranche   | clog
    232  Lock            | tuple
    217  LWLockNamed     | CheckpointerCommLock
    160  LWLockNamed     | XidGenLock
    158  LWLockTranche   | lock_manager
     78  LWLockTranche   | wal_insert
      5  BufferPin       | BufferPin

Patch:
tps = 32708.368938 (including connections establishing)
tps = 32708.765989 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_96_ul.txt
 326601                  |
   7471  LWLockNamed     | CLogControlLock
   5387  Lock            | transactionid
   4018
   3331  LWLockTranche   | buffer_mapping
   3144  LWLockNamed     | ProcArrayLock
   1372  Lock            | extend
    722  LWLockTranche   | buffer_content
    393  LWLockNamed     | XidGenLock
    237  LWLockTranche   | lock_manager
    234  Lock            | tuple
    194  LWLockTranche   | clog
     96  Lock            | relation
     88  LWLockTranche   | wal_insert
     34  LWLockNamed     | CheckpointerCommLock

Test3: number of clients: 64

Head:

tps = 28264.194438 (including connections establishing)
tps = 28264.336270 (excluding connections establishing)

 218264                  |
  10314  LWLockNamed     | CLogControlLock
   4019
   2067  Lock            | transactionid
   1950  LWLockTranche   | buffer_mapping
   1879  LWLockNamed     | ProcArrayLock
    592  Lock            | extend
    565  LWLockTranche   | buffer_content
    222  LWLockNamed     | XidGenLock
    143  LWLockTranche   | clog
    131  LWLockNamed     | CheckpointerCommLock
     63  LWLockTranche   | lock_manager
     52  Lock            | tuple
     35  LWLockTranche   | wal_insert

Patch:
tps = 27906.376194 (including connections establishing)
tps = 27906.531392 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_64_ul.txt
 228108                  |
   4039
   2294  Lock            | transactionid
   2116  LWLockTranche   | buffer_mapping
   1757  LWLockNamed     | ProcArrayLock
   1553  LWLockNamed     | CLogControlLock
    800  Lock            | extend
    403  LWLockTranche   | buffer_content
     92  LWLockNamed     | XidGenLock
     74  LWLockTranche   | lock_manager
     42  Lock            | tuple
     35  LWLockTranche   | wal_insert
     34  LWLockTranche   | clog
     14  LWLockNamed     | CheckpointerCommLock

Test4: number of clients: 32

Head:
tps = 27587.999912 (including connections establishing)
tps = 27588.119611 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
 117762                  |
   4031
    614  LWLockNamed     | ProcArrayLock
    379  LWLockNamed     | CLogControlLock
    344  Lock            | transactionid
    183  Lock            | extend
    102  LWLockTranche   | buffer_mapping
     71  LWLockTranche   | buffer_content
     39  LWLockNamed     | XidGenLock
     25  LWLockTranche   | lock_manager
      3  LWLockTranche   | wal_insert
      3  LWLockTranche   | clog
      2  LWLockNamed     | CheckpointerCommLock
      2  Lock            | tuple

Patch:
tps = 28291.428848 (including connections establishing)
tps = 28291.586435 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
 116596                  |
   4041
    757  LWLockNamed     | ProcArrayLock
    407  LWLockNamed     | CLogControlLock
    358  Lock            | transactionid
    183  Lock            | extend
    142  LWLockTranche   | buffer_mapping
     77  LWLockTranche   | buffer_content
     68  LWLockNamed     | XidGenLock
     35  LWLockTranche   | lock_manager
     15  LWLockTranche   | wal_insert
      7  LWLockTranche   | clog
      7  Lock            | tuple
      4  LWLockNamed     | CheckpointerCommLock
      1 Tuples only is on.

Summary:
- At 96 and more clients count we can see ClogControlLock at the top.
- With patch contention on ClogControlLock is reduced significantly.
I think these behaviours are same as we saw on power.

With 300 scale factor:
- Contention on ClogControlLock is significant only at 192 client
(still transaction id lock is on top), Which is completely removed
with group lock patch.

For 300 scale factor, I am posting data only at 192 client count (If
anyone interested in other data I can post).


In the results you've posted on 10/12, you've mentioned a regression with 32 clients, where you got 52k tps on master but only 48k tps with the patch (so ~10% difference). I have no idea what scale was used for those tests, and I see no such regression in the current results (but you only report results for some of the client counts).

Also, which of the proposed patches have you been testing?

Can you collect and share a more complete set of data, perhaps based on the scripts I use to do tests on the large machine with 36/72 cores, available at https://bitbucket.org/tvondra/hp05-results ?

I've taken some time to build a simple web-based reports from the results collected so far (also included in the git repository), and pushed them here:

    http://tvondra.bitbucket.org

For each of the completed runs, there's a report comparing tps for different client counts with master and the three patches (average tps, median and stddev), and it's possible to download a more thorough text report with wait event stats, comparison of individual runs etc.

If you want to cooperate on this, I'm available - i.e. I can help you get the tooling running, customize it etc.


Regarding the results collected on the "big machine" so far, I do have a few observations:

pgbench / scale 300 (fits into 16GB shared buffers)
---------------------------------------------------
* in general, those results seem fine

* the results generally fall into 3 categories (I'll show results for "pgbench -N" but regular pgbench behaves similarly):

(a) logged, sync_commit=on - no impact
    http://tvondra.bitbucket.org/#pgbench-300-logged-sync-skip

(b) logged, sync_commit=off - improvement
    http://tvondra.bitbucket.org/#pgbench-300-logged-async-skip

    The thoughput gets improved by ~20% with 72 clients, and then it
    levels-off (but does not drop unlike on master). With high client
    counts the difference is up to 300%, but people who care about
    throughput won't run with such client counts anyway.

    And not only this improves throughput, it also significantly
    reduces variability of the performance (i.e. measure throughput
    each second and compute STDDEV of that). You can imagine this
    as a much "smoother" chart of tps over time.

(c) unlogged, sync_commit=* - improvement
    http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

    This is actually quite similar to (b).


dilip / scale 300 (fits into 16GB shared buffers)
-------------------------------------------------

* those results seem less OK

* I haven't found any significant regressions (in the sense of significant performance drop compared to master), but the behavior in some cases seem fairly strange (and it's repeatable)

* consider for example these results:

  http://tvondra.bitbucket.org/#dilip-300-unlogged-async
  http://tvondra.bitbucket.org/#dilip-300-logged-async

* the saw-like pattern is rather suspicious, and I don't think I've seen anything like that before - I guess there's some feedback loop and we better find it before committing any of the patches, because this is something I don't want to see on any production machine (and I bet neither do you)

* After looking into wait even details in the full text report at

  http://tvondra.bitbucket.org/by-test/dilip-300-unlogged-async.txt

  (section "wait events for dilip-300-unlogged-async (runs combined)")

  I see that for pg-9.6-group-update, the statistics for 72, 108 and
  144 clients (low - high - low), the summary looks like this

   clients | wait_event_type |     wait_event  | wait_count | wait_pct
  ---------+-----------------+-----------------+------------+----------
        72 |                 |                 |     374845 |    62.87
        72 | Client          | ClientRead      |     136320 |    22.86
        72 | LWLockNamed     | CLogControlLock |      52804 |     8.86
        72 | LWLockTranche   | buffer_content  |      15337 |     2.57
        72 | LWLockNamed     | XidGenLock      |       7352 |     1.23
        72 | LWLockNamed     | ProcArrayLock   |       6630 |     1.11

       108 |                 |                 |     407179 |    46.01
       108 | LWLockNamed     | CLogControlLock |     300452 |    33.95
       108 | LWLockTranche   | buffer_content  |      87597 |     9.90
       108 | Client          | ClientRead      |      80901 |     9.14
       108 | LWLockNamed     | ProcArrayLock   |       3290 |     0.37

       144 |                 |                 |     623057 |    53.44
       144 | LWLockNamed     | CLogControlLock |     175072 |    15.02
       144 | Client          | ClientRead      |     163451 |    14.02
       144 | LWLockTranche   | buffer_content  |     147963 |    12.69
       144 | LWLockNamed     | XidGenLock      |      38361 |     3.29
       144 | Lock            | transactionid   |       8821 |     0.76

  That is, there's sudden jump on CLogControlLock from 22% to 33% and
  then back to 15% (and for 180 clients it jumps back to ~35%). That's
  pretty strange, and all the patches behave exactly the same.


scale 3000 (45GB), shared_buffers=16GB
---------------------------------------

For the small scale, the whole data set fits into 16GB shared buffers, so there were pretty much no writes except for WAL and CLOG. For scale 3000 that's no longer true - the backends will compete for buffers and will constantly write dirty buffers to page cache.

I haven't realized this initially and the kernel was using the default vm.dirty_* limits, i.e. 10% and 20%. As the machine has 3TB of RAM, this resulted in rather excessive threshold (or "insane" if you want), so the kernel regularly accumulated up to ~15GB of dirty data and then wrote it out in very short period of time. Even though the machine has fairly powerful storage (4GB write cache on controller, 10 x 12Gbps SAS SSDs), this lead to pretty bad latency spikes / drops in throughput.

I've only done two runs with this configuration before realizing what's happening, the results are illustrated here"

* http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-high-dirty-bytes
* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-high-dirty-bytes

I'm not sure how important those results are (if throughput and smooth behavior matters, tuning the kernel thresholds is a must), but what I find interesting is that while the patches manage to improve throughput by 10-20%, they also (quite significantly) increase variability of the results (jitter in the tps over time). It's particularly visible on the pgbench results. I'm not sure that's a good tradeoff.

After fixing the kernel page cache thresholds (by setting background_bytes to 256MB to perform smooth write-out), the effect differs depending on the workload:

(a) dilip
    http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync

    - eliminates any impact of all the patches

(b) pgbench (-N)
    http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip

    - By far the most severe regression observed during the testing.
    With 36 clients the throughput drops by ~40%, which I think is
    pretty bad. Also the results are much more variable with the
    patches (compared to master).


scale 3000 (45GB), shared_buffers=64GB
---------------------------------------

I've also done some tests with increased shared buffers, so that even the large data set fits into them. Again, the results slightly depend on the workload:

(a) dilip

  * http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-64
  * http://tvondra.bitbucket.org/#dilip-3000-unlogged-async-64

  Pretty much no impact on throughput or variability. Unlike on the
  small data set, it the patches don't even eliminate the performance
  drop above 72 clients - the performance closely matches master.

(b) pgbench

  * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-64
  * http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-noskip-64

  There's a small benefit (~20% on the same client count), and the
  performance drop only happens after 72 clients. The patches also
  significantly increase variability of the results, particularly for
  large client counts.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to