* Andres Freund (and...@anarazel.de) wrote: > On 2017-01-19 20:45:57 -0500, Stephen Frost wrote: > > * Andres Freund (and...@anarazel.de) wrote: > > > On 2017-01-19 10:06:09 -0500, Stephen Frost wrote: > > > > WAL replay does do more work, generally speaking (the WAL has to be > > > > read, the checksum validated on it, and then the write has to go out, > > > > while the checkpointer just writes the page out from memory), but it's > > > > also dealing with less contention on the system (there aren't a bunch of > > > > backends hammering the disks to pull data in with reads when you're > > > > doing crash recovery...). > > > > > > There's a huge difference though: WAL replay is single threaded, whereas > > > generating WAL is not. > > > > I'm aware- but *checkpointing* is still single-threaded, unless, as I > > mentioned, you end up with backends pushing out their own changes to the > > heap to make room for new pages to come in. > > Sure, but buffer checkpointing isn't necessarily that large a portion of > the work done in one checkpoint cycle, in comparison to all the WAL > being generated. Quite commonly a lot of the buffers will already have > been flushed to disk by backend and/or bgwriter, and are clean by the > time checkpointer gets to them. So I don't think checkpointer being > single threaded necessarily means much WRT replay performance.
Yes, good point, we also have the bgwriter going through and helping. > > > Especially if there's synchronous IO required > > > (most commonly reading in data, because more data was modified in the > > > current checkpointthan fit in shared buffers, so FPIs don't pre-fill > > > buffers), you can be significantly slower than generating the WAL. > > > > That is an interesting point, if I'm following what you're saying > > correctly- during the replay we can end up having more pages modified > > than fit in shared buffers, which means that we have to read back in > > pages that we pushed out to implement the non-FPI WAL changes to that > > page. > > Right. (And not just during replay obviously, also during the intial WAL > generation). Sure. > > I wonder if we should have a way to configure the amount of memory > > allowed to be used for WAL replay, independent of shared_buffers? > > I don't quite see how that'd work, especially with HS. We just use the > normal shared buffers code etc, and there we can't just resize the > amount of shared_buffers allocated after doing crash recovery. It wouldn't work with HS (or, at least, I have no idea how it would). I was specifically thinking about *just* during crash recovery there (sorry that I didn't make that clear), and my thought was that we'd just allocate the memory locally, not as shared memory, and then drop the whole thing and allocate shared_buffers after crash recovery was done. Obviously, this is a lot of hand-waving, but that's what I was thinking. > > That said, I wonder if our eviction algorithm could be > > improved/changed when performing WAL replay too to reduce the chances > > that we'll have to read a page back in. > > I don't think that's a that promising angle of attach. Having a separate > pre-fetching backend that parses the WAL and pre-reads everything > necessary seems more promising. I agree, that would be helpful and could help with HS too, which I agree is an important piece. Thanks! Stephen
signature.asc
Description: Digital signature