Re: WIP: WAL prefetch (another approach)

Ashutosh Sharma Fri, 10 Dec 2021 00:40:27 -0800

Hi Thomas,

I am unable to apply these new set of patches on HEAD. Can you please share
the rebased patch or if you have any work branch can you please point it
out, I will refer to it for the changes.


--
With Regards,
Ashutosh sharma.

On Tue, Nov 23, 2021 at 3:44 PM Thomas Munro <thomas.mu...@gmail.com> wrote:

> On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson <dan...@yesql.se>
> wrote:
> > Could you post an updated version of the patch which is for review?
>
> Sorry for taking so long to come back; I learned some new things that
> made me want to restructure this code a bit (see below).  Here is an
> updated pair of patches that I'm currently testing.
>
> Old problems:
>
> 1.  Last time around, an infinite loop was reported in pg_waldump.  I
> believe Horiguchi-san has fixed that[1], but I'm no longer depending
> on that patch.  I thought his patch set was a good idea, but it's
> complicated and there's enough going on here already... let's consider
> that independently.
>
> This version goes back to what I had earlier, though (I hope) it is
> better about how "nonblocking" states are communicated.  In this
> version, XLogPageRead() has a way to give up part way through a record
> if it doesn't have enough data and there are queued up records that
> could be replayed right now.  In that case, we'll go back to the
> beginning of the record (and occasionally, back a WAL page) next time
> we try.  That's the cost of not maintaining intra-record decoding
> state.
>
> 2.  Last time around, we could try to allocate a crazy amount of
> memory when reading garbage past the end of the WAL.  Fixed, by
> validating first, like in master.
>
> New work:
>
> Since last time, I went away and worked on a "real" AIO version of
> this feature.  That's ongoing experimental work for a future proposal,
> but I have a working prototype and I aim to share that soon, when that
> branch is rebased to catch up with recent changes.  In that version,
> the prefetcher starts actual reads into the buffer pool, and recovery
> receives already pinned buffers attached to the stream of records it's
> replaying.
>
> That inspired a couple of refactoring changes to this non-AIO version,
> to minimise the difference and anticipate the future work better:
>
> 1.  The logic for deciding which block to start prefetching next is
> moved into a new callback function in a sort of standard form (this is
> approximately how all/most prefetching code looks in the AIO project,
> ie sequential scans, bitmap heap scan, etc).
>
> 2.  The logic for controlling how many IOs are running and deciding
> when to call the above is in a separate component.  In this non-AIO
> version, it works using a simple ring buffer of LSNs to estimate the
> number of in flight I/Os, just like before.  This part would be thrown
> away and replaced with the AIO branch's centralised "streaming read"
> mechanism which tracks I/O completions based on a stream of completion
> events from the kernel (or I/O worker processes).
>
> 3.  In this version, the prefetcher still doesn't pin buffers, for
> simplicity.  That work did force me to study places where WAL streams
> need prefetching "barriers", though, so in this patch you can
> see that it's now a little more careful than it probably needs to be.
> (It doesn't really matter much if you call posix_fadvise() on a
> non-existent file region, or the wrong file after OID wraparound and
> reuse, but it would matter if you actually read it into a buffer, and
> if an intervening record might be trying to drop something you have
> pinned).
>
> Some other changes:
>
> 1.  I dropped the GUC recovery_prefetch_fpw.  I think it was a
> possibly useful idea but it's a niche concern and not worth worrying
> about for now.
>
> 2.  I simplified the stats.  Coming up with a good running average
> system seemed like a problem for another day (the numbers before were
> hard to interpret).  The new stats are super simple counters and
> instantaneous values:
>
> postgres=# select * from pg_stat_prefetch_recovery ;
> -[ RECORD 1 ]--+------------------------------
> stats_reset    | 2021-11-10 09:02:08.590217+13
> prefetch       | 13605674 <- times we called posix_fadvise()
> hit            | 24185289 <- times we found pages already cached
> skip_init      | 217215   <- times we did nothing because init, not read
> skip_new       | 192347   <- times we skipped because relation too small
> skip_fpw       | 27429    <- times we skipped because fpw, not read
> wal_distance   | 10648    <- how far ahead in WAL bytes
> block_distance | 134      <- how far ahead in block references
> io_depth       | 50       <- fadvise() calls not yet followed by pread()
>
> I also removed the code to save and restore the stats via the stats
> collector, for now.  I figured that persistent stats could be a later
> feature, perhaps after the shared memory stats stuff?
>
> 3.  I dropped the code that was caching an SMgrRelation pointer to
> avoid smgropen() calls that showed up in some profiles.  That probably
> lacked invalidation that could be done with some more WAL analysis,
> but I decided to leave it out completely for now for simplicity.
>
> 4.  I dropped the verbose logging.  I think it might make sense to
> integrate with the new "recovery progress" system, but I think that
> should be a separate discussion.  If you want to see the counters
> after crash recovery finishes, you can look at the stats view.
>
> [1] https://commitfest.postgresql.org/34/2113/
>

Re: WIP: WAL prefetch (another approach)

Reply via email to