On Fri, Jan 3, 2020 at 7:10 AM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > On Thu, Jan 02, 2020 at 02:39:04AM +1300, Thomas Munro wrote: > >Based on ideas from earlier discussions[1][2], here is an experimental > >WIP patch to improve recovery speed by prefetching blocks. If you set > >wal_prefetch_distance to a positive distance, measured in bytes, then > >the recovery loop will look ahead in the WAL and call PrefetchBuffer() > >for referenced blocks. This can speed things up with cold caches > >(example: after a server reboot) and working sets that don't fit in > >memory (example: large scale pgbench). > > > > Thanks, I only did a very quick review so far, but the patch looks fine.
Thanks for looking! > >Results vary, but in contrived larger-than-memory pgbench crash > >recovery experiments on a Linux development system, I've seen recovery > >running as much as 20x faster with full_page_writes=off and > >wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as > >discussed in the other thread. > > OK, so how did you test that? I'll do some tests with a traditional > streaming replication setup, multiple sessions on the primary (and maybe > a weaker storage system on the replica). I suppose that's another setup > that should benefit from this. Using a 4GB RAM 16 thread virtual machine running Linux debian10 4.19.0-6-amd64 with an ext4 filesystem on NVMe storage: postgres -D pgdata \ -c full_page_writes=off \ -c checkpoint_timeout=60min \ -c max_wal_size=10GB \ -c synchronous_commit=off # in another shell pgbench -i -s300 postgres psql postgres -c checkpoint pgbench -T60 -Mprepared -c4 -j4 postgres killall -9 postgres # save the crashed pgdata dir for repeated experiments mv pgdata pgdata-save # repeat this with values like wal_prefetch_distance=-1, 1kB, 8kB, 64kB, ... rm -fr pgdata cp -r pgdata-save pgdata postgres -D pgdata -c wal_prefetch_distance=-1 What I see on my desktop machine is around 10x speed-up: wal_prefetch_distance=-1 -> 62s (same number for unpatched) wal_prefetch_distance=8kb -> 6s wal_prefetch_distance=64kB -> 5s On another dev machine I managed to get a 20x speedup, using a much longer test. It's probably more interesting to try out some more realistic workloads rather than this cache-destroying uniform random stuff, though. It might be interesting to test on systems with high random read latency, but high concurrency; I can think of a bunch of network storage environments where that's the case, but I haven't looked into them, beyond some toy testing with (non-Linux) NFS over a slow network (results were promising). > >Earlier work, and how this patch compares: > > > >* Sean Chittenden wrote pg_prefaulter[1], an external process that > >uses worker threads to pread() referenced pages some time before > >recovery does, and demonstrated very good speed-up, triggering a lot > >of discussion of this topic. My WIP patch differs mainly in that it's > >integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather > >than synchronous I/O from worker threads/processes. Sean wouldn't > >have liked my patch much because he was working on ZFS and that > >doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it > >works pretty well, and I'll try to get that upstreamed. > > > > How long would it take to get the POSIX_FADV_WILLNEED to ZFS systems, if > everything goes fine? I'm not sure what's the usual life-cycle, but I > assume it may take a couple years to get it on most production systems. Assuming they like it enough to commit it (and initial informal feedback on the general concept has been positive -- it's not messing with their code at all, it's just boilerplate code to connect the relevant Linux and FreeBSD VFS callbacks), it could indeed be quite a while before it appear in conservative package repos, but I don't know, it depends where you get your OpenZFS/ZoL module from. > What other common filesystems are missing support for this? Using our build farm as a way to know which operating systems we care about as a community, in no particular order: * I don't know for exotic or network filesystems on Linux * AIX 7.2's manual says "Valid option, but this value does not perform any action" for every kind of advice except POSIX_FADV_NOWRITEBEHIND (huh, nonstandard advice). * Solaris's posix_fadvise() was a dummy libc function, as of 10 years ago when they closed the source; who knows after that. * FreeBSD's UFS and NFS support other advice through a default handler but unfortunately ignore WILLNEED (I have patches for those too, not good enough to send anywhere yet). * OpenBSD has no such syscall * NetBSD has the syscall, and I can see that it's hooked up to readahead code, so that's probably the only unqualified yes in this list * Windows has no equivalent syscall; the closest thing might be to use ReadFileEx() to initiate an async read into a dummy buffer; maybe you can use a zero event so it doesn't even try to tell you when the I/O completes, if you don't care? * macOS has no such syscall, but you could in theory do an aio_read() into a dummy buffer. On the other hand I don't think that interface is a general solution for POSIX systems, because on at least Linux and Solaris, aio_read() is emulated by libc with a whole bunch of threads and we are allergic to those things (and even if we weren't, we wouldn't want a whole threadpool in every PostgreSQL process, so you'd need to hand off to a worker process, and then why bother?). * HPUX, I don't know We could test any of those with a simple test I wrote[1], but I'm not likely to test any non-open-source OS myself due to lack of access. Amazingly, HPUX's posix_fadvise() doesn't appear to conform to POSIX: it sets errno and returns -1, while POSIX says that it should return an error number. Checking our source tree, I see that in pg_flush_data(), we also screwed that up and expect errno to be set, though we got it right in FilePrefetch(). In any case, Linux must be at the very least 90% of PostgreSQL installations. Incidentally, sync_file_range() without wait is a sort of opposite of WILLNEED (it means something like "POSIX_FADV_WILLSYNC"), and no one seem terribly upset that we really only have that on Linux (the emulations are pretty poor AFAICS). > Presumably we could do what Sean's extension does, i.e. use a couple of > bgworkers, each doing simple pread() calls. Of course, that's > unnecessarily complicated on systems that have FADV_WILLNEED. That is a good idea, and I agree. I have a patch set that does exactly that. It's nearly independent of the WAL prefetch work; it just changes how PrefetchBuffer() is implemented, affecting bitmap index scans, vacuum and any future user of PrefetchBuffer. If you apply these patches too then WAL prefetch will use it (just set max_background_readers = 4 or whatever): https://github.com/postgres/postgres/compare/master...macdice:bgreader That's simplified from an abandoned patch I had lying around because I was experimenting with prefetching all the way into shared buffers this way. The simplified version just does pread() into a dummy buffer, for the side effect of warming the kernel's cache, pretty much like pg_prefaulter. There are some tricky questions around whether it's better to wait or not when the request queue is full; the way I have that is far too naive, and that question is probably related to your point about being cleverer about how many prefetch blocks you should try to have in flight. A future version of PrefetchBuffer() might lock the buffer then tell the worker (or some kernel async I/O facility) to write the data into the buffer. If I understand correctly, to make that work we need Robert's IO lock/condition variable transplant[2], and Andres's scheme for a suitable interlocking protocol, and no doubt some bulletproof cleanup machinery. I'm not working on any of that myself right now because I don't want to step on Andres's toes. > >Here are some cases where I expect this patch to perform badly: > > > >* Your WAL has multiple intermixed sequential access streams (ie > >sequential access to N different relations), so that sequential access > >is not detected, and then all the WILLNEED advice prevents Linux's > >automagic readahead from working well. Perhaps that could be > >mitigated by having a system that can detect up to N concurrent > >streams, where N is more than the current 1, or by flagging buffers in > >the WAL as part of a sequential stream. I haven't looked into this. > > > > Hmmm, wouldn't it be enough to prefetch blocks in larger batches (not > one by one), and doing some sort of sorting? That should allow readahead > to kick in. Yeah, but I don't want to do too much work in the startup process, or get too opinionated about how the underlying I/O stack works. I think we'd need to do things like that in a direct I/O future, but we'd probably offload it (?). I figured the best approach for early work in this space would be to just get out of the way if we detect sequential access. > >* The data is always found in our buffer pool, so PrefetchBuffer() is > >doing nothing useful and you might as well not be calling it or doing > >the extra work that leads up to that. Perhaps that could be mitigated > >with an adaptive approach: too many PrefetchBuffer() hits and we stop > >trying to prefetch, too many XLogReadBufferForRedo() misses and we > >start trying to prefetch. That might work nicely for systems that > >start out with cold caches but eventually warm up. I haven't looked > >into this. > > > > I think the question is what's the cost of doing such unnecessary > prefetch. Presumably it's fairly cheap, especially compared to the > opposite case (not prefetching a block not in shared buffers). I wonder > how expensive would the adaptive logic be on cases that never need a > prefetch (i.e. datasets smaller than shared_buffers). Hmm. It's basically a buffer map probe. I think the adaptive logic would probably be some kind of periodically resetting counter scheme, but you're probably right to suspect that it might not even be worth bothering with, especially if a single XLogReader can be made to do the readahead with no real extra cost. Perhaps we should work on making the cost of all prefetching overheads as low as possible first, before trying to figure out whether it's worth building a system for avoiding it. > >* The prefetch distance is set too low so that pread() waits are not > >avoided, or your storage subsystem can't actually perform enough > >concurrent I/O to get ahead of the random access pattern you're > >generating, so no distance would be far enough ahead. To help with > >the former case, perhaps we could invent something smarter than a > >user-supplied distance (something like "N cold block references > >ahead", possibly using effective_io_concurrency, rather than "N bytes > >ahead"). > > > > In general, I find it quite non-intuitive to configure prefetching by > specifying WAL distance. I mean, how would you know what's a good value? > If you know the storage hardware, you probably know the optimal queue > depth i.e. you know you the number of requests to get best throughput. FWIW, on pgbench tests on flash storage I've found that 1KB only helps a bit, 8KB is great, and more than that doesn't get any better. Of course, this is meaningless in general; a zipfian workload might need to look a lot further head than a uniform one to find anything worth prefetching, and that's exactly what you're complaining about, and I agree. > But how do you deduce the WAL distance from that? I don't know. Plus > right after the checkpoint the WAL contains FPW, reducing the number of > blocks in a given amount of WAL (compared to right before a checkpoint). > So I expect users might pick unnecessarily high WAL distance. OTOH with > FPW we don't quite need agressive prefetching, right? Yeah, so you need to be touching blocks more than once between checkpoints, if you want to see speed-up on a system with blocks <= BLCKSZ and FPW on. If checkpoints are far enough apart you'll eventually run out of FPWs and start replaying non-FPW stuff. Or you could be on a filesystem with larger blocks than PostgreSQL. > Could we instead specify the number of blocks to prefetch? We'd probably > need to track additional details needed to determine number of blocks to > prefetch (essentially LSN for all prefetch requests). Yeah, I think you're right, we should probably try to make a little queue to track LSNs and count prefetch requests in and out. I think you'd also want PrefetchBuffer() to tell you if the block was already in the buffer pool, so that you don't count blocks that it decided not to prefetch. I guess PrefetchBuffer() needs to return an enum (I already had it returning a bool for another purpose relating to an edge case in crash recovery, when relations have been dropped by a later WAL record). I will think about that. > Another thing to consider might be skipping recently prefetched blocks. > Consider you have a loop that does DML, where each statement creates a > separate WAL record, but it can easily touch the same block over and > over (say inserting to the same page). That means the prefetches are > not really needed, but I'm not sure how expensive it really is. There are two levels of defence against repeatedly prefetching the same block: PrefetchBuffer() checks for blocks that are already in our cache, and before that, PrefetchState remembers the last block so that we can avoid fetching that block (or the following block). [1] https://github.com/macdice/some-io-tests [2] https://www.postgresql.org/message-id/CA%2BTgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr%3DC56Xng%40mail.gmail.com