Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctl start without --wait

Stephen Frost Thu, 19 Jan 2017 18:00:52 -0800

* Andres Freund (and...@anarazel.de) wrote:
> On 2017-01-19 20:45:57 -0500, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> > > On 2017-01-19 10:06:09 -0500, Stephen Frost wrote:
> > > > WAL replay does do more work, generally speaking (the WAL has to be
> > > > read, the checksum validated on it, and then the write has to go out,
> > > > while the checkpointer just writes the page out from memory), but it's
> > > > also dealing with less contention on the system (there aren't a bunch of
> > > > backends hammering the disks to pull data in with reads when you're
> > > > doing crash recovery...).
> > > 
> > > There's a huge difference though: WAL replay is single threaded, whereas
> > > generating WAL is not.  
> > 
> > I'm aware- but *checkpointing* is still single-threaded, unless, as I
> > mentioned, you end up with backends pushing out their own changes to the
> > heap to make room for new pages to come in.
> 
> Sure, but buffer checkpointing isn't necessarily that large a portion of
> the work done in one checkpoint cycle, in comparison to all the WAL
> being generated.  Quite commonly a lot of the buffers will already have
> been flushed to disk by backend and/or bgwriter, and are clean by the
> time checkpointer gets to them.  So I don't think checkpointer being
> single threaded necessarily means much WRT replay performance.


Yes, good point, we also have the bgwriter going through and helping.

> > > Especially if there's synchronous IO required
> > > (most commonly reading in data, because more data was modified in the
> > > current checkpointthan fit in shared buffers, so FPIs don't pre-fill
> > > buffers), you can be significantly slower than generating the WAL.
> > 
> > That is an interesting point, if I'm following what you're saying
> > correctly- during the replay we can end up having more pages modified
> > than fit in shared buffers, which means that we have to read back in
> > pages that we pushed out to implement the non-FPI WAL changes to that
> > page.
> 
> Right. (And not just during replay obviously, also during the intial WAL
> generation).

Sure.

> > I wonder if we should have a way to configure the amount of memory
> > allowed to be used for WAL replay, independent of shared_buffers?
> 
> I don't quite see how that'd work, especially with HS.  We just use the
> normal shared buffers code etc, and there we can't just resize the
> amount of shared_buffers allocated after doing crash recovery.

It wouldn't work with HS (or, at least, I have no idea how it would).  I
was specifically thinking about *just* during crash recovery there
(sorry that I didn't make that clear), and my thought was that we'd just
allocate the memory locally, not as shared memory, and then drop the
whole thing and allocate shared_buffers after crash recovery was done.

Obviously, this is a lot of hand-waving, but that's what I was
thinking.

> > That said, I wonder if our eviction algorithm could be
> > improved/changed when performing WAL replay too to reduce the chances
> > that we'll have to read a page back in.
> 
> I don't think that's a that promising angle of attach. Having a separate
> pre-fetching backend that parses the WAL and pre-reads everything
> necessary seems more promising.

I agree, that would be helpful and could help with HS too, which I agree
is an important piece.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctl start without --wait

Reply via email to