On 16.06.2018 22:02, Andres Freund wrote:
On 2018-06-16 11:38:59 +0200, Tomas Vondra wrote:

On 06/15/2018 08:01 PM, Andres Freund wrote:
On 2018-06-14 10:13:44 +0300, Konstantin Knizhnik wrote:

On 14.06.2018 09:52, Thomas Munro wrote:
On Thu, Jun 14, 2018 at 1:09 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:
pg_wal_prefetch function will infinitely traverse WAL and prefetch block
references in WAL records
using posix_fadvise(WILLNEED) system call.
Hi Konstantin,

Why stop at the page cache...  what about shared buffers?

It is good question. I thought a lot about prefetching directly to shared
buffers.
I think that's definitely how this should work.  I'm pretty strongly
opposed to a prefetching implementation that doesn't read into s_b.

Could you elaborate why prefetching into s_b is so much better (I'm sure it
has advantages, but I suppose prefetching into page cache would be much
easier to implement).
I think there's a number of issues with just issuing prefetch requests
via fadvise etc:

- it leads to guaranteed double buffering, in a way that's just about
   guaranteed to *never* be useful. Because we'd only prefetch whenever
   there's an upcoming write, there's simply no benefit in the page
   staying in the page cache - we'll write out the whole page back to the
   OS.

Sorry, I do not completely understand this.
Prefetch is only needed for partial update of a page - in this case we need to first read page from the disk before been able to perform update. So before "we'll write out the whole page back to the OS" we have to read this page.
And if page is in OS cached (prefetched) then is can be done much faster.

Please notice that at the moment of prefetch there is no double buffering. As far as page is not accessed before, it is not present in shared buffers. And once page is updated,  there is really no need to keep it in shared buffers.  We can use cyclic buffers (like in case  of sequential scan or bulk update) to prevent throwing away useful pages from shared  buffers by redo process. So once again there will no double buffering.
- reading from the page cache is far from free - so you add costs to the
   replay process that it doesn't need to do.
- you don't have any sort of completion notification, so you basically
   just have to guess how far ahead you want to read. If you read a bit
   too much you suddenly get into synchronous blocking land.
- The OS page is actually not particularly scalable to large amounts of
   data either. Nor are the decisions what to keep cached likley to be
   particularly useful.
- We imo need to add support for direct IO before long, and adding more
   and more work to reach feature parity strikes meas a bad move.

I am not so familiar with current implementation of full page writes mechanism in Postgres. So may be my idea explained below is stupid or already implemented (but I failed to find any traces of this). Prefetch is needed only for WAL records performing partial update. Full page write doesn't require prefetch. Full page write has to be performed when the page is update first time after checkpoint. But what if slightly extend this rule and perform full page write also when distance from previous full page write exceeds some delta
(which somehow related with size of OS cache)?

In this case even if checkpoint interval is larger than OS cache size, we still can expect that updated pages are present in OS cache.
And no WAL prefetch is needed at all!


Reply via email to