On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote:
Hello hackers,
Based on ideas from earlier discussions[1][2], here is an experimental
WIP patch to improve recovery speed by prefetching blocks. If you set
wal_prefetch_distance to a positive distance, measured in bytes, then
the recovery loop will look ahead in the WAL and call PrefetchBuffer()
for referenced blocks. This can speed things up with cold caches
(example: after a server reboot) and working sets that don't fit in
memory (example: large scale pgbench).
Thanks, I only did a very quick review so far, but the patch looks fine.
In general, I find it somewhat non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.
But how do you deduce the WAL distance from that? I don't know.
Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).
Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.
Results vary, but in contrived larger-than-memory pgbench crash
recovery experiments on a Linux development system, I've seen recovery
running as much as 20x faster with full_page_writes=off and
wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as
discussed in the other thread.
OK, so how did you test that? I'll do some tests with a traditional
streaming replication setup, multiple sessions on the primary (and maybe
a weaker storage system on the replica). I suppose that's another setup
that should benefit from this.
...
Earlier work, and how this patch compares:
* Sean Chittenden wrote pg_prefaulter[1], an external process that
uses worker threads to pread() referenced pages some time before
recovery does, and demonstrated very good speed-up, triggering a lot
of discussion of this topic. My WIP patch differs mainly in that it's
integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
than synchronous I/O from worker threads/processes. Sean wouldn't
have liked my patch much because he was working on ZFS and that
doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
works pretty well, and I'll try to get that upstreamed.
How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if
everything goes fine? I'm not sure what's the usual life-cycle, but I
assume it may take a couple years to get it on most production systems.
What other common filesystems are missing support for this?
Presumably we could do what Sean's extension does, i.e. use a couple of
bgworkers, each doing simple pread() calls. Of course, that's
unnecessarily complicated on systems that have FADV_WILLNEED.
...
Here are some cases where I expect this patch to perform badly:
* Your WAL has multiple intermixed sequential access streams (ie
sequential access to N different relations), so that sequential access
is not detected, and then all the WILLNEED advice prevents Linux's
automagic readahead from working well. Perhaps that could be
mitigated by having a system that can detect up to N concurrent
streams, where N is more than the current 1, or by flagging buffers in
the WAL as part of a sequential stream. I haven't looked into this.
Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not
one by one), and doing some sort of sorting? That should allow readahead
to kick in.
* The data is always found in our buffer pool, so PrefetchBuffer() is
doing nothing useful and you might as well not be calling it or doing
the extra work that leads up to that. Perhaps that could be mitigated
with an adaptive approach: too many PrefetchBuffer() hits and we stop
trying to prefetch, too many XLogReadBufferForRedo() misses and we
start trying to prefetch. That might work nicely for systems that
start out with cold caches but eventually warm up. I haven't looked
into this.
I think the question is what's the cost of doing such unnecessary
prefetch. Presumably it's fairly cheap, especially compared to the
opposite case (not prefetching a block not in shared buffers). I wonder
how expensive would the adaptive logic be on cases that never need a
prefetch (i.e. datasets smaller than shared_buffers).
* The data is actually always in the kernel's cache, so the advice is
a waste of a syscall. That might imply that you should probably be
running with a larger shared_buffers (?). It's technically possible
to ask the operating system if a region is cached on many systems,
which could in theory be used for some kind of adaptive heuristic that
would disable pointless prefetching, but I'm not proposing that.
Ultimately this problem would be avoided by moving to true async I/O,
where we'd be initiating the read all the way into our buffers (ie it
replaces the later pread() so it's a wash, at worst).
Makes sense.
* The prefetch distance is set too low so that pread() waits are not
avoided, or your storage subsystem can't actually perform enough
concurrent I/O to get ahead of the random access pattern you're
generating, so no distance would be far enough ahead. To help with
the former case, perhaps we could invent something smarter than a
user-supplied distance (something like "N cold block references
ahead", possibly using effective_io_concurrency, rather than "N bytes
ahead").
In general, I find it quite non-intuitive to configure prefetching by
specifying WAL distance. I mean, how would you know what's a good value?
If you know the storage hardware, you probably know the optimal queue
depth i.e. you know you the number of requests to get best throughput.
But how do you deduce the WAL distance from that? I don't know. Plus
right after the checkpoint the WAL contains FPW, reducing the number of
blocks in a given amount of WAL (compared to right before a checkpoint).
So I expect users might pick unnecessarily high WAL distance. OTOH with
FPW we don't quite need agressive prefetching, right?
Could we instead specify the number of blocks to prefetch? We'd probably
need to track additional details needed to determine number of blocks to
prefetch (essentially LSN for all prefetch requests).
Another thing to consider might be skipping recently prefetched blocks.
Consider you have a loop that does DML, where each statement creates a
separate WAL record, but it can easily touch the same block over and
over (say inserting to the same page). That means the prefetches are
not really needed, but I'm not sure how expensive it really is.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services