Re: automatic analyze: readahead - add "IO read time" log message

Jakub Wartak Wed, 04 Nov 2020 01:09:03 -0800

Greetings Stephen,

I saw up 410MB/s for a few seconds with this patch on NVMe, and that's 
huge ~5.2x improvement which is amazing for a such simple patch.

The system and data was identical like last time, so results are directly 
comparable 
to the previous post. The only change is that I've applied Yours patch on top 
of 
REL12_STABLE as this is cluster where I was having original data. I think it 
deserves
 it's own commitfest entry, right? Or do we combine those two?

Each test was conducted with 
1. alter system set effective_io_concurrency=$N; 
2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 
restart;
3. analyze verbose t;
4. the measurements were observed by pidstat -d and iostat -dxm.

no readahead (blockdev --setra 0), cold cache:
effective_io_concurrency=64; => ~26MB/s
effective_io_concurrency=32; => ~29MB/s
effective_io_concurrency=8; => ~31MB/s
effective_io_concurrency=4; => ~32MB/s
effective_io_concurrency=1; => ~32MB/s
effective_io_concurrency=0; => ~32MB/s

readahead=256 (128kB, Linux defaults), cold cache:
effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s 
, baseline)
effective_io_concurrency=1; => ~62MB/s
effective_io_concurrency=4; => ~370MB/s
effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU 
util 100%
effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)

I think one needs to be sure to restart, so that autovacuum 
workers get the new effective_io_concurrency, maybe that's a 
documentation thing only.

I have just one question, with this patch the strace I/O of analyze 
look like below:

pread(72, ..., 8192, 738181120) = 8192
fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, "..., 8192, 738500608) = 8192
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, .., 8192, 738607104) = 8192
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, ..., 8192, 738754560) = 8192
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0

If you highlight "738754560" in the output it appears to duplicate the 
syscalls issued until it preads() - in case of "738754560" offset it was 
asked for 3 times. Also I wouldn't  imagine in wildest dreams that 
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.

-J.

----------------------------------------------------------------------------------------------------------------

From: Stephen Frost
Sent: Tuesday, November 3, 2020 6:47 PM
To: Jakub Wartak
Cc: pgsql-hackers
Subject: Re: automatic analyze: readahead - add "IO read time" log message 

Greetings,

* Jakub Wartak (jakub.war...@tomtom.com) wrote:
> >Interesting that you weren't seeing any benefit to disabling readahead.
> 
> I've got some free minutes and I have repeated the exercise in more realistic
> and strict environment that previous one to conclude that the current 
> situation is preferable:

Thanks for spending time on this!

> Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 
> 35GB, 65GB (as reported by \di+)
> Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might 
> is different layer and might have different storage settings), VG on top of 
> dm-0, LV with stripe-size 8kB, ext4.
> s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical 
> output of PgSQL12: 
> INFO:  "x": scanned 1500000 of 22395442 pages, containing 112410444 live rows 
> and 0 dead rows; 1500000 rows in sample, 1678321053 estimated total rows
> 
> Hot VFS cache:
> Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is 
> spent heapam_scan_analyze_next_block() -> .. -> pread() which causes 
> ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat 
> with average request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, 
> hot caches, total elapsed ~3m
> Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
> Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
> Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s
> 
> No VFS cache:
> Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s 
> read, ondemand_readahead visible in perf, total elapsed 3m30s
> Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total 
> elapsed 3m29s
> Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total 
> elapsed 3m28s
> 
> No VFS cache, readahead off:
> Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 
> /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors 
> (*512=5.5kB), total elapsed 5m59s
> Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 
> /dev/dm-0: as above, double-confirmed no readaheads [ 
> pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total 
> elapsed 5m56s
> Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 
> /dev/dm-0: as above, total elapsed 5m55s

[ ... ]

> >The VACUUM case is going to be complicated by what's in the visibility
> >map. (..) 
> 
> After observing the ANALYZE readahead behavior benefit I've abandoned
> the case of testing much more advanced VACUUM processing, clearly Linux 
> read-ahead is beneficial in even simple cases.

This seems to be indicating that while the Linux kernel may end up
reading pages we don't end up needing, it's much more often the case
that it's ending up reading *some* pages that we do need, and that's
happening often enough that it more than makes up for the extra reads
being done.

Instead of having these guessing games between the kernel and what PG's
doing, however, we could potentially do better using posix_fadvise() to
tell the kernel, up front, exactly what blocks we are going to ask for,
and perhaps that would end up improving things.

Attached is a very much rough-n-ready patch for doing that, using
effective_io_concurrency to control how many blocks to pre-fetch for
ANALYZE (0 meaning 'none').  If you've got a chance to test with
different settings for effective_io_concurrency with the patch applied
to see what impact posix_fadvise() has on these ANALYZE runs, that would
be very cool to see.

Going between effective_cache_size = 0 and effective_cache_size = 10
with this patch, in some very quick testing on a laptop NVMe, while
making sure to drop caches and restart PG in between to clear out
shared_buffers, definitely shows that prefetching done this way is an
improvement over letting the kernel's normal read ahead handle it.

> >> My only idea would be that a lot of those blocks could be read 
> >> asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on 
> >> block-range before, so maybe the the optimization is possible, but not 
> >> until we'll have AIO ;)
> >
> > (..)AIO is a whole other animal that's been discussed off and on
> >around here but it's a much larger and more invasive change than just
> >calling posix_fadvise().
> 
> Yes, I'm aware and I'm keeping my fingers crossed that maybe some day.... 

I don't think we should throw out the idea of using PrefetchBuffer()
here.  "Real" AIO would certainly be good to have one of these days, but
until then, posix_fadvise() could net us some of those gains in the
meantime.

> The ANALYZE just seem fit to be natural candidate to use it. The only easy 
> chance 
> of acceleration of stats gathering - at least to me and enduser point of view 
> - 
> is to have more parallel autoanalyze workers running to drive more I/O 
> concurrency 
> (by e.g. partitioning the data),  both in readahead and non-readahead 
> scenarios. 
> Which is a pity because 70-80% of such process sits idle. The readahead might 
> read 
> 10x more unnecessary data, but pread() doesn't have to wait. 
> <speculation>Once AIO 
> would be it could throw thousands of requests without readahead and achieve 
> much 
> better efficiency probably</speculation>

Actual parallel ANALYZE could be interesting, of course, but that's also
quite a bit more complicated to make happen, and even in that case it
would seem like we'd probably want to use posix_fadvise() anyway since
it'd hopefully help.

> I hope the previous simple patch goes into master and helps other people 
> understand
> the picture more easily.

I agree that we should also be providing users with more information
about what's going on, but maybe we can do that and improve things by
leveraging posix_fadvise().

Thanks,

Stephen

Re: automatic analyze: readahead - add "IO read time" log message

Reply via email to