Re: WIP: WAL prefetch (another approach)

Thomas Munro Thu, 02 Jan 2020 20:59:32 -0800

On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote:
> >Based on ideas from earlier discussions[1][2], here is an experimental
> >WIP patch to improve recovery speed by prefetching blocks.  If you set
> >wal_prefetch_distance to a positive distance, measured in bytes, then
> >the recovery loop will look ahead in the WAL and call PrefetchBuffer()
> >for referenced blocks.  This can speed things up with cold caches
> >(example: after a server reboot) and working sets that don't fit in
> >memory (example: large scale pgbench).
> >
>
> Thanks, I only did a very quick review so far, but the patch looks fine.


Thanks for looking!

> >Results vary, but in contrived larger-than-memory pgbench crash
> >recovery experiments on a Linux development system, I've seen recovery
> >running as much as 20x faster with full_page_writes=off and
> >wal_prefetch_distance=8kB.  FPWs reduce the potential speed-up as
> >discussed in the other thread.
>
> OK, so how did you test that? I'll do some tests with a traditional
> streaming replication setup, multiple sessions on the primary (and maybe
> a weaker storage system on the replica). I suppose that's another setup
> that should benefit from this.

Using a 4GB RAM 16 thread virtual machine running Linux debian10
4.19.0-6-amd64 with an ext4 filesystem on NVMe storage:

postgres -D pgdata \
  -c full_page_writes=off \
  -c checkpoint_timeout=60min \
  -c max_wal_size=10GB \
  -c synchronous_commit=off

# in another shell
pgbench -i -s300 postgres
psql postgres -c checkpoint
pgbench -T60 -Mprepared -c4 -j4 postgres
killall -9 postgres

# save the crashed pgdata dir for repeated experiments
mv pgdata pgdata-save

# repeat this with values like wal_prefetch_distance=-1, 1kB, 8kB, 64kB, ...
rm -fr pgdata
cp -r pgdata-save pgdata
postgres -D pgdata -c wal_prefetch_distance=-1

What I see on my desktop machine is around 10x speed-up:

wal_prefetch_distance=-1 -> 62s (same number for unpatched)
wal_prefetch_distance=8kb -> 6s
wal_prefetch_distance=64kB -> 5s

On another dev machine I managed to get a 20x speedup, using a much
longer test.  It's probably more interesting to try out some more
realistic workloads rather than this cache-destroying uniform random
stuff, though.  It might be interesting to test on systems with high
random read latency, but high concurrency; I can think of a bunch of
network storage environments where that's the case, but I haven't
looked into them, beyond some toy testing with (non-Linux) NFS over a
slow network (results were promising).

> >Earlier work, and how this patch compares:
> >
> >* Sean Chittenden wrote pg_prefaulter[1], an external process that
> >uses worker threads to pread() referenced pages some time before
> >recovery does, and demonstrated very good speed-up, triggering a lot
> >of discussion of this topic.  My WIP patch differs mainly in that it's
> >integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
> >than synchronous I/O from worker threads/processes.  Sean wouldn't
> >have liked my patch much because he was working on ZFS and that
> >doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
> >works pretty well, and I'll try to get that upstreamed.
> >
>
> How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if
> everything goes fine? I'm not sure what's the usual life-cycle, but I
> assume it may take a couple years to get it on most production systems.

Assuming they like it enough to commit it (and initial informal
feedback on the general concept has been positive -- it's not messing
with their code at all, it's just boilerplate code to connect the
relevant Linux and FreeBSD VFS callbacks), it could indeed be quite a
while before it appear in conservative package repos, but I don't
know, it depends where you get your OpenZFS/ZoL module from.

> What other common filesystems are missing support for this?

Using our build farm as a way to know which operating systems we care
about as a community, in no particular order:

* I don't know for exotic or network filesystems on Linux
* AIX 7.2's manual says "Valid option, but this value does not perform
any action" for every kind of advice except POSIX_FADV_NOWRITEBEHIND
(huh, nonstandard advice).
* Solaris's posix_fadvise() was a dummy libc function, as of 10 years
ago when they closed the source; who knows after that.
* FreeBSD's UFS and NFS support other advice through a default handler
but unfortunately ignore WILLNEED (I have patches for those too, not
good enough to send anywhere yet).
* OpenBSD has no such syscall
* NetBSD has the syscall, and I can see that it's hooked up to
readahead code, so that's probably the only unqualified yes in this
list
* Windows has no equivalent syscall; the closest thing might be to use
ReadFileEx() to initiate an async read into a dummy buffer; maybe you
can use a zero event so it doesn't even try to tell you when the I/O
completes, if you don't care?
* macOS has no such syscall, but you could in theory do an aio_read()
into a dummy buffer.  On the other hand I don't think that interface
is a general solution for POSIX systems, because on at least Linux and
Solaris, aio_read() is emulated by libc with a whole bunch of threads
and we are allergic to those things (and even if we weren't, we
wouldn't want a whole threadpool in every PostgreSQL process, so you'd
need to hand off to a worker process, and then why bother?).
* HPUX, I don't know

We could test any of those with a simple test I wrote[1], but I'm not
likely to test any non-open-source OS myself due to lack of access.
Amazingly, HPUX's posix_fadvise() doesn't appear to conform to POSIX:
it sets errno and returns -1, while POSIX says that it should return
an error number.  Checking our source tree, I see that in
pg_flush_data(), we also screwed that up and expect errno to be set,
though we got it right in FilePrefetch().

In any case, Linux must be at the very least 90% of PostgreSQL
installations.  Incidentally, sync_file_range() without wait is a sort
of opposite of WILLNEED (it means something like
"POSIX_FADV_WILLSYNC"), and no one seem terribly upset that we really
only have that on Linux (the emulations are pretty poor AFAICS).

> Presumably we could do what Sean's extension does, i.e. use a couple of
> bgworkers, each doing simple pread() calls. Of course, that's
> unnecessarily complicated on systems that have FADV_WILLNEED.

That is a good idea, and I agree.  I have a patch set that does
exactly that.  It's nearly independent of the WAL prefetch work; it
just changes how PrefetchBuffer() is implemented, affecting bitmap
index scans, vacuum and any future user of PrefetchBuffer.  If you
apply these patches too then WAL prefetch will use it (just set
max_background_readers = 4 or whatever):

https://github.com/postgres/postgres/compare/master...macdice:bgreader

That's simplified from an abandoned patch I had lying around because I
was experimenting with prefetching all the way into shared buffers
this way.  The simplified version just does pread() into a dummy
buffer, for the side effect of warming the kernel's cache, pretty much
like pg_prefaulter.  There are some tricky questions around whether
it's better to wait or not when the request queue is full; the way I
have that is far too naive, and that question is probably related to
your point about being cleverer about how many prefetch blocks you
should try to have in flight.  A future version of PrefetchBuffer()
might lock the buffer then tell the worker (or some kernel async I/O
facility) to write the data into the buffer.  If I understand
correctly, to make that work we need Robert's IO lock/condition
variable transplant[2], and Andres's scheme for a suitable
interlocking protocol, and no doubt some bulletproof cleanup
machinery.  I'm not working on any of that myself right now because I
don't want to step on Andres's toes.

> >Here are some cases where I expect this patch to perform badly:
> >
> >* Your WAL has multiple intermixed sequential access streams (ie
> >sequential access to N different relations), so that sequential access
> >is not detected, and then all the WILLNEED advice prevents Linux's
> >automagic readahead from working well.  Perhaps that could be
> >mitigated by having a system that can detect up to N concurrent
> >streams, where N is more than the current 1, or by flagging buffers in
> >the WAL as part of a sequential stream.  I haven't looked into this.
> >
>
> Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not
> one by one), and doing some sort of sorting? That should allow readahead
> to kick in.

Yeah, but I don't want to do too much work in the startup process, or
get too opinionated about how the underlying I/O stack works.  I think
we'd need to do things like that in a direct I/O future, but we'd
probably offload it (?).  I figured the best approach for early work
in this space would be to just get out of the way if we detect
sequential access.

> >* The data is always found in our buffer pool, so PrefetchBuffer() is
> >doing nothing useful and you might as well not be calling it or doing
> >the extra work that leads up to that.  Perhaps that could be mitigated
> >with an adaptive approach: too many PrefetchBuffer() hits and we stop
> >trying to prefetch, too many XLogReadBufferForRedo() misses and we
> >start trying to prefetch.  That might work nicely for systems that
> >start out with cold caches but eventually warm up.  I haven't looked
> >into this.
> >
>
> I think the question is what's the cost of doing such unnecessary
> prefetch. Presumably it's fairly cheap, especially compared to the
> opposite case (not prefetching a block not in shared buffers). I wonder
> how expensive would the adaptive logic be on cases that never need a
> prefetch (i.e. datasets smaller than shared_buffers).

Hmm.  It's basically a buffer map probe.  I think the adaptive logic
would probably be some kind of periodically resetting counter scheme,
but you're probably right to suspect that it might not even be worth
bothering with, especially if a single XLogReader can be made to do
the readahead with no real extra cost.  Perhaps we should work on
making the cost of all prefetching overheads as low as possible first,
before trying to figure out whether it's worth building a system for
avoiding it.

> >* The prefetch distance is set too low so that pread() waits are not
> >avoided, or your storage subsystem can't actually perform enough
> >concurrent I/O to get ahead of the random access pattern you're
> >generating, so no distance would be far enough ahead.  To help with
> >the former case, perhaps we could invent something smarter than a
> >user-supplied distance (something like "N cold block references
> >ahead", possibly using effective_io_concurrency, rather than "N bytes
> >ahead").
> >
>
> In general, I find it quite non-intuitive to configure prefetching by
> specifying WAL distance. I mean, how would you know what's a good value?
> If you know the storage hardware, you probably know the optimal queue
> depth i.e. you know you the number of requests to get best throughput.

FWIW, on pgbench tests on flash storage I've found that 1KB only helps
a bit, 8KB is great, and more than that doesn't get any better.  Of
course, this is meaningless in general; a zipfian workload might need
to look a lot further head than a uniform one to find anything worth
prefetching, and that's exactly what you're complaining about, and I
agree.

> But how do you deduce the WAL distance from that? I don't know. Plus
> right after the checkpoint the WAL contains FPW, reducing the number of
> blocks in a given amount of WAL (compared to right before a checkpoint).
> So I expect users might pick unnecessarily high WAL distance. OTOH with
> FPW we don't quite need agressive prefetching, right?

Yeah, so you need to be touching blocks more than once between
checkpoints, if you want to see speed-up on a system with blocks <=
BLCKSZ and FPW on.  If checkpoints are far enough apart you'll
eventually run out of FPWs and start replaying non-FPW stuff.  Or you
could be on a filesystem with larger blocks than PostgreSQL.

> Could we instead specify the number of blocks to prefetch? We'd probably
> need to track additional details needed to determine number of blocks to
> prefetch (essentially LSN for all prefetch requests).

Yeah, I think you're right, we should probably try to make a little
queue to track LSNs and count prefetch requests in and out.  I think
you'd also want PrefetchBuffer() to tell you if the block was already
in the buffer pool, so that you don't count blocks that it decided not
to prefetch.  I guess PrefetchBuffer() needs to return an enum (I
already had it returning a bool for another purpose relating to an
edge case in crash recovery, when relations have been dropped by a
later WAL record).  I will think about that.

> Another thing to consider might be skipping recently prefetched blocks.
> Consider you have a loop that does DML, where each statement creates a
> separate WAL record, but it can easily touch the same block over and
> over (say inserting to the same page). That means the prefetches are
> not really needed, but I'm not sure how expensive it really is.

There are two levels of defence against repeatedly prefetching the
same block: PrefetchBuffer() checks for blocks that are already in our
cache, and before that, PrefetchState remembers the last block so that
we can avoid fetching that block (or the following block).

[1] https://github.com/macdice/some-io-tests
[2] 
https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com

Re: WIP: WAL prefetch (another approach)

Reply via email to