Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

Claudio Freire Tue, 06 Sep 2016 11:26:00 -0700

On Tue, Sep 6, 2016 at 3:45 AM, Simon Riggs <[email protected]> wrote:
> On 5 September 2016 at 21:58, Claudio Freire <[email protected]> wrote:
>
>>>>> How long does that part ever take? Is there any substantial gain from 
>>>>> this?
>
>> Btw, without a further patch to prefetch pages on the backward scan
>> for truncate, however, my patience ran out before it finished
>> truncating. I haven't submitted that patch because there was an
>> identical patch in an older thread that was discussed and more or less
>> rejected since it slightly penalized SSDs.
>
> OK, thats enough context. Sorry for being forgetful on that point.
>
> Please post that new patch also.

Attached.

On Mon, Sep 5, 2016 at 5:58 PM, Claudio Freire <[email protected]> wrote:
> On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs <[email protected]> wrote:
>> On 5 September 2016 at 15:50, Claudio Freire <[email protected]> wrote:
>>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs <[email protected]> wrote:
>>>> On 3 September 2016 at 04:25, Claudio Freire <[email protected]> 
>>>> wrote:
>>>>> The patch also makes vacuum free the dead_tuples before starting
>>>>> truncation. It didn't seem necessary to hold onto it beyond that
>>>>> point, and it might help give the OS more cache, especially if work
>>>>> mem is configured very high to avoid multiple index scans.
>>>>
>>>> How long does that part ever take? Is there any substantial gain from this?
>>>>
>>>> Lets discuss that as a potential second patch.
>>>
>>> In the test case I mentioned, it takes longer than the vacuum part itself.
>>
>> Please provide a test case and timings so we can see what's happening.

Robert made a strong point for a change in the approach, so the
information below is applicable only to the old patch (to be
rewritten).

I'm sending this merely to document the testing done, it will be a
while before I can get the proposed design running and tested.

> The referenced test case is the one I mentioned on the OP:
>
> - createdb pgbench
> - pgbench -i -s 4000 pgbench
> - psql pgbench -c 'delete from pgbench_accounts;'
> - vacuumdb -v -t pgbench_accounts pgbench
>
> fsync=off, autovacuum=off, maintainance_work_mem=4GB
>
> From what I remember, it used ~2.7GB of RAM up until the truncate
> phase, where it freed it. It performed a single index scan over the
> PK.
>
> I don't remember timings, and I didn't take them, so I'll have to
> repeat the test to get them. It takes all day and makes my laptop
> unusably slow, so I'll post them later, but they're not very
> interesting. The only interesting bit is that it does a single index
> scan instead of several, which on TB-or-more tables it's kinda nice.

So, the test results below:

During setup, maybe for context, the delete took 52m 50s real time
(measured with time psql pgbench -c 'delete from pgbench_accounts;')

During the delete, my I/O was on average like the following, which
should give an indication of what my I/O subsystem is capable of (not
much, granted):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.47     5.27   35.53   77.42    17.58    42.95
1097.51   145.22 1295.23   33.47 1874.36   8.85 100.00

Since it's a 5k RPM laptop drive, it's rather slow on IOPS, and since
I'm using the defaults for shared buffers and checkpoints, write
thoughput isn't stellar either. But that's not the point of the test
anyway, it's just for context.

The hardware is an HP envy laptop with a 1TB 5.4k RPM hard drive, 12GB
RAM, core i7-4722HQ, no weird performance tweaking of any kind (ie:
cpu scaling left intact). The system was not dedicated of course,
being a laptop, but it had little else going on while the test was
running. Given the size of the test, I don't believe there's any
chance concurrent activity could invalidate the results.

The timing for setup was comparable with both versions (patched and
unpatched), so I'm reporting the patched times only.

The vacuum phase:

patched:

$ vacuumdb -v -t pgbench_accounts pgbench
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 400000000 row versions
DETAIL:  CPU 12.46s/48.76u sec elapsed 566.47 sec.
INFO:  "pgbench_accounts": removed 400000000 row versions in 6557378 pages
DETAIL:  CPU 56.68s/28.90u sec elapsed 1872.76 sec.
INFO:  index "pgbench_accounts_pkey" now contains 0 row versions in
1096762 pages
DETAIL:  400000000 index row versions were removed.
1092896 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.47 sec.
INFO:  "pgbench_accounts": found 400000000 removable, 0 nonremovable
row versions in 6557378 out of 6557378 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 129.24s/127.24u sec elapsed 3877.13 sec.
INFO:  "pgbench_accounts": truncated 6557378 to 0 pages
DETAIL:  CPU 34.88s/7.91u sec elapsed 1645.90 sec.

Total elapsed time: ~92 minutes

I/O during initial heap scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.52    99.78   72.63   62.47    31.94    33.22
987.80   146.71 1096.29   25.39 2341.48   7.40 100.00

Index scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               7.08     3.87   55.18   59.87    17.06    31.83
870.33   146.61 1243.34   31.42 2360.44   8.69 100.00

Final heap scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.78     8.65   65.32   57.32    31.50    32.96
1076.56   152.22 1928.67 1410.63 2519.01   8.15 100.00

Truncate (with prefetch):

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda             159.67     0.87 3720.03    2.82    30.31     0.12
16.74    19.11    5.13    4.95  242.60   0.27  99.23

Without prefetch, rMB/s during truncation varies between 4MB/s and
6MB/s, so it's on average 6 times slower, meaning it would take over 3
hours.

Peak memory used: 2369MB RSS, 4260MB VIRT  (source: top)

Unpatched + prefetch (same config, effective work mem 1GB due to
non-huge allocation limit):

$ vacuumdb -v -t pgbench_accounts pgbench
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 178956737 row versions
DETAIL:  CPU 5.88s/53.77u sec elapsed 263.63 sec.
INFO:  "pgbench_accounts": removed 178956737 row versions in 2933717 pages
DETAIL:  CPU 22.28s/12.94u sec elapsed 757.45 sec.
INFO:  scanned index "pgbench_accounts_pkey" to remove 178956737 row versions
DETAIL:  CPU 7.44s/31.28u sec elapsed 282.41 sec.
INFO:  "pgbench_accounts": removed 178956737 row versions in 2933717 pages
DETAIL:  CPU 22.24s/13.30u sec elapsed 806.54 sec.
INFO:  scanned index "pgbench_accounts_pkey" to remove 42086526 row versions
DETAIL:  CPU 4.30s/5.83u sec elapsed 170.30 sec.
INFO:  "pgbench_accounts": removed 42086526 row versions in 689944 pages
DETAIL:  CPU 3.35s/3.23u sec elapsed 126.22 sec.
INFO:  index "pgbench_accounts_pkey" now contains 0 row versions in
1096762 pages
DETAIL:  400000000 index row versions were removed.
1096351 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.40 sec.
INFO:  "pgbench_accounts": found 400000000 removable, 0 nonremovable
row versions in 6557378 out of 6557378 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 123.82s/183.76u sec elapsed 4071.54 sec.
INFO:  "pgbench_accounts": truncated 6557378 to 0 pages
DETAIL:  CPU 40.36s/7.72u sec elapsed 1648.22 sec.

Total elapsed time:  ~95m

I/O during initial heap scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.48    32.53   66.10   60.50    31.95    34.88
1081.06   149.20 1175.78   25.44 2432.59   8.02 101.59

First index scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.17    14.95   43.85   70.07    19.65    40.18
1075.57   145.98 1278.39   31.86 2058.51   8.78 100.00

Final index scan:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              17.12     1.50  169.85    2.28    68.33     0.67
820.95   158.32  312.00   28.14 21426.95   5.81 100.00

Truncation:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda             142.93     1.23 3444.70    4.65    28.16     0.36
16.93    18.52    5.37    5.25   91.17   0.29  99.22

Peak memory used is 1135MB RSS and 1188MB VIRT

Comparing:

Time reportedly spent scanning indexes: 716.34 unpatched, 566.47 patched
Time reportedly spent scanning heap: 1690.21 unpatched, 1872.76 patched
Total vacuum scan as reported: 4071.54 unpatched, 3877.13 patched

Surely I didn't expect it to be such a close call. I believe the key
reason is the speedup it got during the final index scan for not
having to delete so many tuples. Clearly, having to interleave reads
and writes is stressing my HD, and the last index scan, having to
write less, was thus faster.

I don't believe this would have happened if the index hadn't been
pristine and in almost physical (heap) order, so I'd expect real-world
cases (with properly aged, shuffled and bloated indexes) to show a
more pronounced difference. Or when using a cost limit, that will
artificially limit the I/O rate vacuum can reach.

Clearly the patch is of use when I/O is the limiting factor, either
due to vacuum cost limits, or due to the I/O subsystem being the
bottleneck, as was the case during the above test case. Since more
work mem will mean a slower lookup of the dead_tuples array, not only
due to the extra comparisons but also poorer cache locality, I believe
it won't benefit the runtime cost of CPU-bound cases, but it should at
least generate less WAL since that's another benefit of scanning the
indexes fewer times (increased WAL rates during vacuum is another
problem we regularly face in our production setup).

Given the I/O subsystem on my test machine isn't able to produce a
CPU-bound test case for the amount of dead_tuples involved in
stressing the patch, I cannot confirm the above statement, but it
should be evident given the implementation.

From 7c7efe7f8451a23b62d1c7285b44a34b533b370f Mon Sep 17 00:00:00 2001
From: Claudio Freire <[email protected]>
Date: Fri, 2 Sep 2016 23:33:48 -0300
Subject: [PATCH 2/2] Vacuum: prefetch buffers on backward scan

This patch speeds up truncation by prefetching buffers during
the backward scan phase just prior to truncation. This optimization
helps rotating media because disk rotation induces a buffer-to-buffer
latency of roughly a whole rotation when reading backwards, such
disks are usually optimized for forward scans. Prefetching buffers
in larger chunks ameliorates the issue considerably and is otherwise
harmless. The prefetch amount should also not be an issue on SSD,
which won't benefit from the optimization, but won't be hurt either.
---
 src/backend/commands/vacuumlazy.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index dbe2040..fd18aae 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1828,7 +1828,7 @@ lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats)
 static BlockNumber
 count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 {
-	BlockNumber blkno;
+	BlockNumber blkno, prefetchBlkno;
 	instr_time	starttime;
 
 	/* Initialize the starttime if we check for conflicting lock requests */
@@ -1836,6 +1836,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 	/* Strange coding of loop control is needed because blkno is unsigned */
 	blkno = vacrelstats->rel_pages;
+	prefetchBlkno = blkno & ~0x1f;
+	prefetchBlkno = (prefetchBlkno > 32) ? prefetchBlkno - 32 : 0;
 	while (blkno > vacrelstats->nonempty_pages)
 	{
 		Buffer		buf;
@@ -1885,6 +1887,23 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 		blkno--;
 
+		/*
+		 * Speed up reads on rotating rust by prefetching a few blocks ahead.
+		 * Backward scans on rotary disks are slow if done one buffer at a time
+		 * because readahead won't kick in on most OSes, and each buffer will
+		 * have to wait for the platter to do a full rotation.
+		 * Should be harmless on SSD.
+		 */
+		if (blkno <= prefetchBlkno) {
+			BlockNumber pblkno;
+			if (prefetchBlkno >= 32)
+				prefetchBlkno -= 32;
+			else
+				prefetchBlkno = 0;
+			for (pblkno = prefetchBlkno; pblkno < blkno; pblkno++)
+				PrefetchBuffer(onerel, MAIN_FORKNUM, pblkno);
+		}
+
 		buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,
 								 RBM_NORMAL, vac_strategy);
 
-- 
2.6.6

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

Reply via email to