On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> [patch]

Hi Thomas / hackers,

I just wanted to help testing this patch (defer SLRU fsyncs during recovery) 
and also faster compactify_tuples() patch [2] as both are related to the WAL 
recovery performance in which I'm interested in. This is my first message to 
this mailing group so please let me know also if I should adjust testing style 
or formatting.

With both of those patches applied:
make check -> Passes
make check-world -> Passes
make standbycheck (section "Testing Hot Standby" from docs) -> Passes
There wasn't a single segfault or postmaster crash during the tests.
Review of the patches itself: I'm not qualified to review the PostgreSQL 
internals.

I've used redo-bench scripts [1] by Thomas to measure the performance effect 
(this approach simplifies testing and excludes network jittering effects): 1st 
column is redo start->end timing, 2nd is redo end -> end of checkpointing 
timing before opening the DB for reads. I've conducted 2-3 separate tests that 
show benefits of those patches depending on the workload:
- Handing SLRU sync work over to the checkpointer: in my testing it accelerates 
WAL recovery performance on slower / higer latency storage by ~20%
- Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery 
performance for HOT updates also by ~20%

TEST1: workload profile test as per standard TPC-B pgbench -c8 -j8, with 
majority of records in WAL stream being Heap/HOT_UPDATE:

xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
72.991, 0.919
73.688, 1.027
72.228, 1.032

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
72.271, 0.857
72.717, 0.748
72.705, 0.81

xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
72.49, 0.103
74.069, 0.102
73.368, 0.102

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
70.312, 0.22
70.615, 0.201
69.739, 0.194

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
69.686, 0.101
70.601, 0.102
70.042, 0.101

As Thomas stated in the standard pgbench workload profile on recovery side 
there is compactify_tuples()->pg_qsort() overhead visible. So this is where the 
2nd patch helps:

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and 
compactify_tuples()->pg_qsort() performance improvement
58.85, 0.296
58.605, 0.269
58.603, 0.277

NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and 
compactify_tuples()->pg_qsort() performance improvement, fsync=off
58.618, 0.103
57.693, 0.101
58.779, 0.111

In the last final case the top profile is as follows related still to the 
sorting but as I understand in much optimal way:

    26.68%  postgres  postgres            [.] qsort_itemoff
            ---qsort_itemoff
               |--14.17%--qsort_itemoff
               |          |--10.96%--compactify_tuples
               |          |          PageRepairFragmentation
               |          |          heap2_redo
               |          |          StartupXLOG
               |           --3.21%--qsort_itemoff
               |                      --3.10%--compactify_tuples
               |                                PageRepairFragmentation
               |                                heap2_redo
               |                                StartupXLOG
                --12.51%--compactify_tuples
                          PageRepairFragmentation
                          heap2_redo
                          StartupXLOG

     8.38%  postgres  libc-2.17.so        [.] __memmove_ssse3_back
            ---__memmove_ssse3_back
               compactify_tuples
               PageRepairFragmentation
               heap2_redo
               StartupXLOG

     6.51%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--3.62%--smgropen
               |          |--2.17%--XLogReadBufferExtended
               |          |           --1.76%--XLogReadBufferForRedoExtended
               |          |                      --0.93%--heap_xlog_update
               |           --1.45%--ReadBufferWithoutRelcache
               |                     XLogReadBufferExtended
               |                      --1.34%--XLogReadBufferForRedoExtended
               |                                 --0.72%--heap_xlog_update
                --2.69%--BufTableLookup
                          ReadBuffer_common
                          ReadBufferWithoutRelcache
                          XLogReadBufferExtended
                           --2.48%--XLogReadBufferForRedoExtended
                                     |--1.34%--heap2_redo
                                     |          StartupXLOG
                                      --0.83%--heap_xlog_update


So to sum, HOT update-like workload profile tends to be CPU bound on single 
process recovery side. Even slow storage (like xvda) was not the bottleneck 
here as I've used already hot stuff from VFS cache.

TEST2:  The second suite of tests used append-only workload profile (the same 
amount of transactions, using pgbench -c8 -j8 -f insert_only.sql -n -t 
1000000), however with simpler structure:
CREATE TABLE t (c1 uuid NOT NULL, c2 uuid NOT NULL, c3 integer NOT NULL, c4 
integer NOT NULL); 
CREATE INDEX i1 ON t USING btree (c1);
CREATE INDEX i2 ON t USING btree (c2);
CREATE INDEX i3 ON t USING btree (c3);
CREATE EXTENSION "uuid-ossp"

and customized script for pgbench, just this: BEGIN; INSERT INTO t 
(c1,c2,c3,c4) values ( uuid_generate_v4(), uuid_generate_v4(), 
round(5000*random()), 500); END;

Majority of WAL records being Btree/INSERT_LEAF (60%), results are following: 

xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
120.732, 5.275
120.608, 5.902
120.685, 5.872

xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
99.061, 7.071
99.472, 6.386
98.036, 5.994

xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
102.838, 0.136
102.711, 0.099
103.093, 0.0970001

NVMe, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
96.46, 0.405
96.121, 0.405
95.951, 0.402

NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
94.124, 0.387
96.61, 0.416
94.624, 0.451

NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
95.401, 0.0969999
95.028, 0.099
94.632, 0.0970001

So apparently the v2-0001-Defer-flushing-of-SLRU-files helps in my case on 
higher latency storage.

The append-only bottleneck appears to be limited by syscalls/s due to small 
block size even with everything in FS cache (but not in shared buffers, please 
compare with TEST1 as there was no such bottleneck at all):

    29.62%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
            ---copy_user_enhanced_fast_string
               |--17.98%--copyin
[..]
               |          __pwrite_nocancel
               |          FileWrite
               |          mdwrite
               |          FlushBuffer
               |          ReadBuffer_common
               |          ReadBufferWithoutRelcache
               |          XLogReadBufferExtended
               |          XLogReadBufferForRedoExtended
               |           --17.57%--btree_xlog_insert
               |                     btree_redo
               |                     StartupXLOG
               |
                --11.64%--copyout
[..]
                          __pread_nocancel
                           --11.44%--FileRead
                                     mdread
                                     ReadBuffer_common
                                     ReadBufferWithoutRelcache
                                     XLogReadBufferExtended
                                     XLogReadBufferForRedoExtended
                                      --11.34%--btree_xlog_insert
                                                btree_redo
                                                StartupXLOG

     5.62%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--1.74%--smgropen
[..]

No# of syscalls/s topped at ~100k/s @ 8kB for each read & write respectively 
(it's logical I/O as everything fits in VFS cache, nearly no physical I/O). It 
was also visible during the test that the startup/recovering process spent 60% 
of it's time in %sys time in such conditions. As there was no sorting visible 
in profiler, I've did not test the workload with 
compactify_tuples()->pg_qsort() performance improvement here, although from 
basic runs it appears it did not introduce any degradation.

TEST2b: Quite frankly, at 1st glance I've did not seem to understand why 
btree_xlog_insert()->ReadBuffer_common() would require FlushBuffer() that would 
write, until I've bumped shared buffers 128MB -> 24GB as it might have been 
flushing dirty buffers which caused those pwrite's() - it's not evident as 
direct call in code from ReadBuffer*()  (inlining? any idea how to monitor on 
standby the cycling of dirty buffers during recovery when there is no bgwriter 
yet?). Still I wanted to eliminate storage and VFS cache as bottleneck. This 
lead me to results like below (with fsync=on, also defer patch, 
shared_buffers=24GB to eliminate VFS cache lookups):
27.341, 0.858
27.26, 0.869
26.959, 0.86

Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any 
difference, so if anyone is curious the next sets of append-only bottlenecks is 
like below:

    14.69%  postgres  postgres            [.] hash_search_with_hash_value
            ---hash_search_with_hash_value
               |--9.80%--BufTableLookup
               |          ReadBuffer_common
               |          ReadBufferWithoutRelcache
               |          XLogReadBufferExtended
               |          XLogReadBufferForRedoExtended
               |          |--7.76%--btree_xlog_insert
               |          |          btree_redo
               |          |          StartupXLOG
               |           --1.63%--heap_xlog_insert
                --4.90%--smgropen
                          |--2.86%--ReadBufferWithoutRelcache
                          |          XLogReadBufferExtended
                          |          |--2.04%--XLogReadBufferForRedoExtended
                          |          |          |--1.22%--btree_xlog_insert
                          |          |          |          btree_redo
                          |          |          |          StartupXLOG
                          |          |           --0.82%--heap_xlog_insert
                          |           --0.82%--XLogRecordPageWithFreeSpace
                          |                     heap_xlog_insert
                           --2.04%--XLogReadBufferExtended
                                     XLogReadBufferForRedoExtended
                                      --1.63%--btree_xlog_insert
                                                btree_redo
                                                StartupXLOG

     7.76%  postgres  libc-2.17.so        [.] __memmove_ssse3_back
            ---__memmove_ssse3_back
               PageAddItemExtended
               btree_xlog_insert
               btree_redo
               StartupXLOG

Still the result seems to be nice, as it is 10^6 (trx) / 27s = ~34k TPS on 
recovery side (~3000MB of WAL/~27s = ~111MB/s without FPW) provided there would 
be no I/O overheads and my assumptions. For full picture, compare ratio 
generation of "primary" data with INSERTs I was able to get without any special 
tuning:
- ~29k TPS of INSERTs-only/COMMITs with pgbench -c8
- ~49k TPS of INSERTs-only/COMMITs with pgbench -c16
so the WAL (single process) recovery code in 14master seems to have like 70% of 
performance of fairly low-end primary 16vCPU DB in above append-only conditions.

Specs: Linux 4.14 kernel, ext4 filesystems (data=ordered,noatime), 1s8c16t Xeon 
CPU E5-2686 v4 @ 2.30GHz, 128GB RAM, gcc 7.2.1, CFLAGS set by ./configure to 
"-O2", test on default/tiny shared_buffers until last test.

xvda AKA slow storage: root file system, single thread tests:
        dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==>  157 MB/s
        dd if=/dev/zero of=test bs=8k count=10000 oflag=direct =>  17.2 MB/s

NVMe:  striped VG consisting of 2x NVMes devices with much lower latency
        dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==>  1.9 GB/s or 
maybe even more
        dd if=/dev/zero of=test bs=8k count=10000 oflag=direct =>  141 MB/s

-Jakub Wartak.

[1] - https://github.com/macdice/redo-bench/
[2] - https://commitfest.postgresql.org/29/2687/


Reply via email to