On Tue, Sep 6, 2016 at 3:45 AM, Simon Riggs <si...@2ndquadrant.com> wrote: > On 5 September 2016 at 21:58, Claudio Freire <klaussfre...@gmail.com> wrote: > >>>>> How long does that part ever take? Is there any substantial gain from >>>>> this? > >> Btw, without a further patch to prefetch pages on the backward scan >> for truncate, however, my patience ran out before it finished >> truncating. I haven't submitted that patch because there was an >> identical patch in an older thread that was discussed and more or less >> rejected since it slightly penalized SSDs. > > OK, thats enough context. Sorry for being forgetful on that point. > > Please post that new patch also.
Attached. On Mon, Sep 5, 2016 at 5:58 PM, Claudio Freire <klaussfre...@gmail.com> wrote: > On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs <si...@2ndquadrant.com> wrote: >> On 5 September 2016 at 15:50, Claudio Freire <klaussfre...@gmail.com> wrote: >>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <si...@2ndquadrant.com> wrote: >>>> On 3 September 2016 at 04:25, Claudio Freire <klaussfre...@gmail.com> >>>> wrote: >>>>> The patch also makes vacuum free the dead_tuples before starting >>>>> truncation. It didn't seem necessary to hold onto it beyond that >>>>> point, and it might help give the OS more cache, especially if work >>>>> mem is configured very high to avoid multiple index scans. >>>> >>>> How long does that part ever take? Is there any substantial gain from this? >>>> >>>> Lets discuss that as a potential second patch. >>> >>> In the test case I mentioned, it takes longer than the vacuum part itself. >> >> Please provide a test case and timings so we can see what's happening. Robert made a strong point for a change in the approach, so the information below is applicable only to the old patch (to be rewritten). I'm sending this merely to document the testing done, it will be a while before I can get the proposed design running and tested. > The referenced test case is the one I mentioned on the OP: > > - createdb pgbench > - pgbench -i -s 4000 pgbench > - psql pgbench -c 'delete from pgbench_accounts;' > - vacuumdb -v -t pgbench_accounts pgbench > > fsync=off, autovacuum=off, maintainance_work_mem=4GB > > From what I remember, it used ~2.7GB of RAM up until the truncate > phase, where it freed it. It performed a single index scan over the > PK. > > I don't remember timings, and I didn't take them, so I'll have to > repeat the test to get them. It takes all day and makes my laptop > unusably slow, so I'll post them later, but they're not very > interesting. The only interesting bit is that it does a single index > scan instead of several, which on TB-or-more tables it's kinda nice. So, the test results below: During setup, maybe for context, the delete took 52m 50s real time (measured with time psql pgbench -c 'delete from pgbench_accounts;') During the delete, my I/O was on average like the following, which should give an indication of what my I/O subsystem is capable of (not much, granted): Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.47 5.27 35.53 77.42 17.58 42.95 1097.51 145.22 1295.23 33.47 1874.36 8.85 100.00 Since it's a 5k RPM laptop drive, it's rather slow on IOPS, and since I'm using the defaults for shared buffers and checkpoints, write thoughput isn't stellar either. But that's not the point of the test anyway, it's just for context. The hardware is an HP envy laptop with a 1TB 5.4k RPM hard drive, 12GB RAM, core i7-4722HQ, no weird performance tweaking of any kind (ie: cpu scaling left intact). The system was not dedicated of course, being a laptop, but it had little else going on while the test was running. Given the size of the test, I don't believe there's any chance concurrent activity could invalidate the results. The timing for setup was comparable with both versions (patched and unpatched), so I'm reporting the patched times only. The vacuum phase: patched: $ vacuumdb -v -t pgbench_accounts pgbench INFO: vacuuming "public.pgbench_accounts" INFO: scanned index "pgbench_accounts_pkey" to remove 400000000 row versions DETAIL: CPU 12.46s/48.76u sec elapsed 566.47 sec. INFO: "pgbench_accounts": removed 400000000 row versions in 6557378 pages DETAIL: CPU 56.68s/28.90u sec elapsed 1872.76 sec. INFO: index "pgbench_accounts_pkey" now contains 0 row versions in 1096762 pages DETAIL: 400000000 index row versions were removed. 1092896 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.47 sec. INFO: "pgbench_accounts": found 400000000 removable, 0 nonremovable row versions in 6557378 out of 6557378 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. Skipped 0 pages due to buffer pins. 0 pages are entirely empty. CPU 129.24s/127.24u sec elapsed 3877.13 sec. INFO: "pgbench_accounts": truncated 6557378 to 0 pages DETAIL: CPU 34.88s/7.91u sec elapsed 1645.90 sec. Total elapsed time: ~92 minutes I/O during initial heap scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.52 99.78 72.63 62.47 31.94 33.22 987.80 146.71 1096.29 25.39 2341.48 7.40 100.00 Index scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 7.08 3.87 55.18 59.87 17.06 31.83 870.33 146.61 1243.34 31.42 2360.44 8.69 100.00 Final heap scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.78 8.65 65.32 57.32 31.50 32.96 1076.56 152.22 1928.67 1410.63 2519.01 8.15 100.00 Truncate (with prefetch): Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 159.67 0.87 3720.03 2.82 30.31 0.12 16.74 19.11 5.13 4.95 242.60 0.27 99.23 Without prefetch, rMB/s during truncation varies between 4MB/s and 6MB/s, so it's on average 6 times slower, meaning it would take over 3 hours. Peak memory used: 2369MB RSS, 4260MB VIRT (source: top) Unpatched + prefetch (same config, effective work mem 1GB due to non-huge allocation limit): $ vacuumdb -v -t pgbench_accounts pgbench INFO: vacuuming "public.pgbench_accounts" INFO: scanned index "pgbench_accounts_pkey" to remove 178956737 row versions DETAIL: CPU 5.88s/53.77u sec elapsed 263.63 sec. INFO: "pgbench_accounts": removed 178956737 row versions in 2933717 pages DETAIL: CPU 22.28s/12.94u sec elapsed 757.45 sec. INFO: scanned index "pgbench_accounts_pkey" to remove 178956737 row versions DETAIL: CPU 7.44s/31.28u sec elapsed 282.41 sec. INFO: "pgbench_accounts": removed 178956737 row versions in 2933717 pages DETAIL: CPU 22.24s/13.30u sec elapsed 806.54 sec. INFO: scanned index "pgbench_accounts_pkey" to remove 42086526 row versions DETAIL: CPU 4.30s/5.83u sec elapsed 170.30 sec. INFO: "pgbench_accounts": removed 42086526 row versions in 689944 pages DETAIL: CPU 3.35s/3.23u sec elapsed 126.22 sec. INFO: index "pgbench_accounts_pkey" now contains 0 row versions in 1096762 pages DETAIL: 400000000 index row versions were removed. 1096351 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.40 sec. INFO: "pgbench_accounts": found 400000000 removable, 0 nonremovable row versions in 6557378 out of 6557378 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. Skipped 0 pages due to buffer pins. 0 pages are entirely empty. CPU 123.82s/183.76u sec elapsed 4071.54 sec. INFO: "pgbench_accounts": truncated 6557378 to 0 pages DETAIL: CPU 40.36s/7.72u sec elapsed 1648.22 sec. Total elapsed time: ~95m I/O during initial heap scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.48 32.53 66.10 60.50 31.95 34.88 1081.06 149.20 1175.78 25.44 2432.59 8.02 101.59 First index scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 1.17 14.95 43.85 70.07 19.65 40.18 1075.57 145.98 1278.39 31.86 2058.51 8.78 100.00 Final index scan: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 17.12 1.50 169.85 2.28 68.33 0.67 820.95 158.32 312.00 28.14 21426.95 5.81 100.00 Truncation: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 142.93 1.23 3444.70 4.65 28.16 0.36 16.93 18.52 5.37 5.25 91.17 0.29 99.22 Peak memory used is 1135MB RSS and 1188MB VIRT Comparing: Time reportedly spent scanning indexes: 716.34 unpatched, 566.47 patched Time reportedly spent scanning heap: 1690.21 unpatched, 1872.76 patched Total vacuum scan as reported: 4071.54 unpatched, 3877.13 patched Surely I didn't expect it to be such a close call. I believe the key reason is the speedup it got during the final index scan for not having to delete so many tuples. Clearly, having to interleave reads and writes is stressing my HD, and the last index scan, having to write less, was thus faster. I don't believe this would have happened if the index hadn't been pristine and in almost physical (heap) order, so I'd expect real-world cases (with properly aged, shuffled and bloated indexes) to show a more pronounced difference. Or when using a cost limit, that will artificially limit the I/O rate vacuum can reach. Clearly the patch is of use when I/O is the limiting factor, either due to vacuum cost limits, or due to the I/O subsystem being the bottleneck, as was the case during the above test case. Since more work mem will mean a slower lookup of the dead_tuples array, not only due to the extra comparisons but also poorer cache locality, I believe it won't benefit the runtime cost of CPU-bound cases, but it should at least generate less WAL since that's another benefit of scanning the indexes fewer times (increased WAL rates during vacuum is another problem we regularly face in our production setup). Given the I/O subsystem on my test machine isn't able to produce a CPU-bound test case for the amount of dead_tuples involved in stressing the patch, I cannot confirm the above statement, but it should be evident given the implementation.
From 7c7efe7f8451a23b62d1c7285b44a34b533b370f Mon Sep 17 00:00:00 2001 From: Claudio Freire <klaussfre...@gmail.com> Date: Fri, 2 Sep 2016 23:33:48 -0300 Subject: [PATCH 2/2] Vacuum: prefetch buffers on backward scan This patch speeds up truncation by prefetching buffers during the backward scan phase just prior to truncation. This optimization helps rotating media because disk rotation induces a buffer-to-buffer latency of roughly a whole rotation when reading backwards, such disks are usually optimized for forward scans. Prefetching buffers in larger chunks ameliorates the issue considerably and is otherwise harmless. The prefetch amount should also not be an issue on SSD, which won't benefit from the optimization, but won't be hurt either. --- src/backend/commands/vacuumlazy.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index dbe2040..fd18aae 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -1828,7 +1828,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats) static BlockNumber count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) { - BlockNumber blkno; + BlockNumber blkno, prefetchBlkno; instr_time starttime; /* Initialize the starttime if we check for conflicting lock requests */ @@ -1836,6 +1836,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) /* Strange coding of loop control is needed because blkno is unsigned */ blkno = vacrelstats->rel_pages; + prefetchBlkno = blkno & ~0x1f; + prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0; while (blkno > vacrelstats->nonempty_pages) { Buffer buf; @@ -1885,6 +1887,23 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) blkno--; + /* + * Speed up reads on rotating rust by prefetching a few blocks ahead. + * Backward scans on rotary disks are slow if done one buffer at a time + * because readahead won't kick in on most OSes, and each buffer will + * have to wait for the platter to do a full rotation. + * Should be harmless on SSD. + */ + if (blkno <= prefetchBlkno) { + BlockNumber pblkno; + if (prefetchBlkno >= 32) + prefetchBlkno -= 32; + else + prefetchBlkno = 0; + for (pblkno = prefetchBlkno; pblkno < blkno; pblkno++) + PrefetchBuffer(onerel, MAIN_FORKNUM, pblkno); + } + buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno, RBM_NORMAL, vac_strategy); -- 2.6.6
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers