Re: WAL prefetch

Konstantin Knizhnik Sun, 17 Jun 2018 00:06:26 -0700



On 17.06.2018 03:00, Andres Freund wrote:

On 2018-06-16 23:25:34 +0300, Konstantin Knizhnik wrote:


On 16.06.2018 22:02, Andres Freund wrote:

On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:

On 06/15/2018 08:01 PM, Andres Freund wrote:

On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

On 14.06.2018 09:52, Thomas Munro wrote:

On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.

Hi Konstantin,

Why stop at the page cache...  what about shared buffers?

It is good question. I thought a lot about prefetching directly to shared
buffers.

I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.

Could you elaborate why prefetching into s_b is so much better (I'm sure it
has advantages, but I suppose prefetching into page cache would be much
easier to implement).

I think there's a number of issues with just issuing prefetch requests
via fadvise etc:

- it leads to guaranteed double buffering, in a way that's just about
    guaranteed to *never* be useful. Because we'd only prefetch whenever
    there's an upcoming write, there's simply no benefit in the page
    staying in the page cache - we'll write out the whole page back to the
    OS.

Sorry, I do not completely understand this.
Prefetch is only needed for partial update of a page - in this case we need
to first read page from the disk

Yes.

before been able to perform update. So before "we'll write out the whole
page back to the OS" we have to read this page.
And if page is in OS cached (prefetched) then is can be done much faster.

Yes.

Please notice that at the moment of prefetch there is no double
buffering.

Sure, but as soon as it's read there is.

As far as page is not accessed before, it is not present in shared buffers.
And once page is updated,  there is really no need to keep it in shared
buffers.  We can use cyclic buffers (like in case  of sequential scan or
bulk update) to prevent throwing away useful pages from shared  buffers by
redo process. So once again there will no double buffering.

That's a terrible idea. There's a *lot* of spatial locality of further
WAL records arriving for the same blocks.

In some cases it is true, in some cases - not. In typical OLTP system ifrecord is updated, then there is high probability thatit will be accessed soon. So if at such system we perform write requestson master and read-only queries at replicas,

keeping updated pages in shared buffers at replica can be very helpful.

But if replica is used for running mostly analytic queries while masterperforms some updates, thenit is more useful to keep in replica's cache indexes and mostfrequently accessed pages, rather than recent updates from the master.

So at least it seems to be reasonable to have such parameter and makeDBA to choose caching policy at replicas.

I am not so familiar with current implementation of full page writes
mechanism in Postgres.
So may be my idea explained below is stupid or already implemented (but I
failed to find any traces of this).
Prefetch is needed only for WAL records performing partial update. Full page
write doesn't require prefetch.
Full page write has to be performed when the page is update first time after
checkpoint.
But what if slightly extend this rule and perform full page write also when
distance from previous full page write exceeds some delta
(which somehow related with size of OS cache)?

In this case even if checkpoint interval is larger than OS cache size, we
still can expect that updated pages are present in OS cache.
And no WAL prefetch is needed at all!

We could do so, but I suspect the WAL volume penalty would be
prohibitive in many cases. Worthwhile to try though.

Well, the typical size of server's memory is now several hundreds ofmegabytes.Certainly some of this memory is used for shared buffers, backends workmemory, ...But still there are hundreds of gigabytes of free memory which can beused by OS for caching.Let's assume that full page write threshold is 100Gb. So one extra 8kbfor 100Gb of WAL!Certainly it is estimation only for one page and it is more realistic toexpect that we have to force full page writes for most of the updatedpages. But still I do not believe that it will cause significant growthof log size.

Another question is why do we choose so large checkpoint interval: rethan hundred gigabytes.Certainly frequent checkpoints have negative impact on performance. But100Gb is not "too frequent" in any case...

Re: WAL prefetch

Reply via email to