Hi Thomas, I am unable to apply these new set of patches on HEAD. Can you please share the rebased patch or if you have any work branch can you please point it out, I will refer to it for the changes.
-- With Regards, Ashutosh sharma. On Tue, Nov 23, 2021 at 3:44 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > On Mon, Nov 15, 2021 at 11:31 PM Daniel Gustafsson <dan...@yesql.se> > wrote: > > Could you post an updated version of the patch which is for review? > > Sorry for taking so long to come back; I learned some new things that > made me want to restructure this code a bit (see below). Here is an > updated pair of patches that I'm currently testing. > > Old problems: > > 1. Last time around, an infinite loop was reported in pg_waldump. I > believe Horiguchi-san has fixed that[1], but I'm no longer depending > on that patch. I thought his patch set was a good idea, but it's > complicated and there's enough going on here already... let's consider > that independently. > > This version goes back to what I had earlier, though (I hope) it is > better about how "nonblocking" states are communicated. In this > version, XLogPageRead() has a way to give up part way through a record > if it doesn't have enough data and there are queued up records that > could be replayed right now. In that case, we'll go back to the > beginning of the record (and occasionally, back a WAL page) next time > we try. That's the cost of not maintaining intra-record decoding > state. > > 2. Last time around, we could try to allocate a crazy amount of > memory when reading garbage past the end of the WAL. Fixed, by > validating first, like in master. > > New work: > > Since last time, I went away and worked on a "real" AIO version of > this feature. That's ongoing experimental work for a future proposal, > but I have a working prototype and I aim to share that soon, when that > branch is rebased to catch up with recent changes. In that version, > the prefetcher starts actual reads into the buffer pool, and recovery > receives already pinned buffers attached to the stream of records it's > replaying. > > That inspired a couple of refactoring changes to this non-AIO version, > to minimise the difference and anticipate the future work better: > > 1. The logic for deciding which block to start prefetching next is > moved into a new callback function in a sort of standard form (this is > approximately how all/most prefetching code looks in the AIO project, > ie sequential scans, bitmap heap scan, etc). > > 2. The logic for controlling how many IOs are running and deciding > when to call the above is in a separate component. In this non-AIO > version, it works using a simple ring buffer of LSNs to estimate the > number of in flight I/Os, just like before. This part would be thrown > away and replaced with the AIO branch's centralised "streaming read" > mechanism which tracks I/O completions based on a stream of completion > events from the kernel (or I/O worker processes). > > 3. In this version, the prefetcher still doesn't pin buffers, for > simplicity. That work did force me to study places where WAL streams > need prefetching "barriers", though, so in this patch you can > see that it's now a little more careful than it probably needs to be. > (It doesn't really matter much if you call posix_fadvise() on a > non-existent file region, or the wrong file after OID wraparound and > reuse, but it would matter if you actually read it into a buffer, and > if an intervening record might be trying to drop something you have > pinned). > > Some other changes: > > 1. I dropped the GUC recovery_prefetch_fpw. I think it was a > possibly useful idea but it's a niche concern and not worth worrying > about for now. > > 2. I simplified the stats. Coming up with a good running average > system seemed like a problem for another day (the numbers before were > hard to interpret). The new stats are super simple counters and > instantaneous values: > > postgres=# select * from pg_stat_prefetch_recovery ; > -[ RECORD 1 ]--+------------------------------ > stats_reset | 2021-11-10 09:02:08.590217+13 > prefetch | 13605674 <- times we called posix_fadvise() > hit | 24185289 <- times we found pages already cached > skip_init | 217215 <- times we did nothing because init, not read > skip_new | 192347 <- times we skipped because relation too small > skip_fpw | 27429 <- times we skipped because fpw, not read > wal_distance | 10648 <- how far ahead in WAL bytes > block_distance | 134 <- how far ahead in block references > io_depth | 50 <- fadvise() calls not yet followed by pread() > > I also removed the code to save and restore the stats via the stats > collector, for now. I figured that persistent stats could be a later > feature, perhaps after the shared memory stats stuff? > > 3. I dropped the code that was caching an SMgrRelation pointer to > avoid smgropen() calls that showed up in some profiles. That probably > lacked invalidation that could be done with some more WAL analysis, > but I decided to leave it out completely for now for simplicity. > > 4. I dropped the verbose logging. I think it might make sense to > integrate with the new "recovery progress" system, but I think that > should be a separate discussion. If you want to see the counters > after crash recovery finishes, you can look at the stats view. > > [1] https://commitfest.postgresql.org/34/2113/ >