Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Cedric Villemain Wed, 03 Jan 2024 15:24:06 -0800

Hi,

I wonder what you think of making pg_prewarm use recent addition onsmgrprefetch and readv ?

In order to try, I did it anyway in the attached patches. They containno doc update, but I will proceed if it is of interest.


In summary:

1. The first one adds a new check on parameters (checking last block isindeed not before first block).

Consequence is an ERROR is raised instead of silently doing nothing.

2. The second one does implement smgrprefetch with range and loops bydefault per segment to still have a check for interrupts.

3. The third one provides smgrreadv instead of smgrread, by default ona range of 8 buffers. I am absolutely unsure that I used readv correctly...

Q: posix_fadvise may not work exactly the way you think it does, or doesit ?



In details, and for the question:

It's not so obvious that the "feature" is really required or wanted,depending on what are the expectations from user point of view.

The kernel decides on what to do with posix_fadvise calls, and how wepass parameters does impact the decision.With the current situation where prefetch is done step by step, block byblock, they are very probably most of the time all loaded even if thosefrom the beginning of the relation can be discarded at the end of theprefetch.

However, if instead you provide a real range, or the magic len=0 toposix_fadvise, then blocks are "more" loaded according to effective vmpressure (which is not the case on the previous example).As a result only a small part of the relation might be loaded, and thisis probably not what end-users expect despite being probably a goodchoice (you can still free cache beforehand to help the kernel).

An example, below I'm using vm_relation_cachestat() which provides linuxcachestat output, and vm_relation_fadvise() to unload cache, andpg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,not postgres blocks)

```

postgres=# select block_start, block_count, nr_pages, nr_cache fromvm_relation_cachestat('foo',range:=1024*32);

block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |        0
      32768 |       32768 |    65536 |        0
      65536 |       32768 |    65536 |        0
      98304 |       32768 |    65536 |        0
     131072 |        1672 |     3344 |        0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
    132744
(1 row)
```

# Checking results:

```

postgres=# select block_start, block_count, nr_pages, nr_cache fromvm_relation_cachestat('foo',range:=1024*32);

block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |      320
      32768 |       32768 |    65536 |        0
      65536 |       32768 |    65536 |        0
      98304 |       32768 |    65536 |        0
     131072 |        1672 |     3344 |      320  <-- segment 1

```

# Load block by block and check:

```

postgres=# select from generate_series(0, 132743) g(n), lateralpg_prewarm('foo','prefetch', 'main', n, n);postgres=# select block_start, block_count, nr_pages, nr_cache fromvm_relation_cachestat('foo',range:=1024*32);

block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |    65536
      32768 |       32768 |    65536 |    65536
      65536 |       32768 |    65536 |    65536
      98304 |       32768 |    65536 |    65536
     131072 |        1672 |     3344 |     3344

```

The duration of the last example is also really significant: fullrelation is 0.3ms and block by block is 1550ms!You might think it's because of generate_series or whatever, but I havethe exact same behavior with pgfincore.I can compare loading and unloading duration for similar "async" work,here each call is from block 0 with len of 132744 and a range of 1 block(i.e. posix_fadvise on 8kB at a time).So they have exactly the same number of operations doing DONTNEED orWILLNEED, but distinct duration on the first "load":

```

postgres=# select * fromvm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');

vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms

postgres=# select * fromvm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');

vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !

postgres=# select * fromvm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');

vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

Thank you for your time reading this longer than expected email.

Comments ?


---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D

From eeb24acd782f1dc2afff80fccfcec442521b9622 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= <[email protected]>
Date: Wed, 3 Jan 2024 19:03:22 +0100
Subject: [PATCH 1/3] Check last block greater or equal to first block in
 pg_prewarm

This prevents useless call and inform users of unexpected value as
parameter.
---
 contrib/pg_prewarm/pg_prewarm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 01fc2c8ad9..1de1b39482 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -133,10 +133,11 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	else
 	{
 		last_block = PG_GETARG_INT64(4);
-		if (last_block < 0 || last_block >= nblocks)
+		if (last_block < first_block || last_block >= nblocks)
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-					 errmsg("ending block number must be between 0 and %lld",
+					 errmsg("ending block number must be between %lld and %lld",
+							(long long) (first_block),
 							(long long) (nblocks - 1))));
 	}
 
-- 
2.39.2

From 8fd2a0611a13ca40fe7d4c28a27257ba0f7e2a7a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= <[email protected]>
Date: Wed, 3 Jan 2024 19:08:37 +0100
Subject: [PATCH 2/3] Allow pg_prewarm to use new smgrprefetch range

This commit replaces BufferPrefetch with smgrprefetch (with range).
It reduces the number of instructions to prefetch data.

I also defined a PREWARM_PREFETCH_RANGE which is by default the size of
a segment so there is a chance to interrupt the prefetching (on very
large table it might be desirable).
---
 contrib/pg_prewarm/pg_prewarm.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 1de1b39482..aea07927db 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -25,6 +25,8 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+#define PREWARM_PREFETCH_RANGE	RELSEG_SIZE
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(pg_prewarm);
@@ -156,11 +158,21 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 * no practical way to do that at present without a gross modularity
 		 * violation, so we just do this.
 		 */
-		for (block = first_block; block <= last_block; ++block)
+		for (block = first_block; block <= last_block;
+			 block += PREWARM_PREFETCH_RANGE)
 		{
+			int seek = Min(PREWARM_PREFETCH_RANGE, (last_block - block + 1));
+
+			/*
+			 * if handling a multi-TB relation, we need a way to interrupt the
+			 * prefetching: smgrprefetch (mdprefetch) will loop on all segments
+			 * without interruption so we use a range and keep the following
+			 * CHECK in place
+			 */
 			CHECK_FOR_INTERRUPTS();
-			PrefetchBuffer(rel, forkNumber, block);
-			++blocks_done;
+
+			smgrprefetch(RelationGetSmgr(rel), forkNumber, block, seek);
+			blocks_done += seek;
 		}
 #else
 		ereport(ERROR,
-- 
2.39.2

From cb2fa6f0245b28a1d3f83ac901eb0d8698f22dd3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= <[email protected]>
Date: Wed, 3 Jan 2024 20:16:37 +0100
Subject: [PATCH 3/3] Allow pg_prewarm to use new smgrreadv

This commit replaces smgrread with smgrreadv (with range).

I also defined a PREWARM_READ_RANGE, 8 by default, which is the
number of buffers to read in one call.

I just discovered pg_readv so I hope the initialization I wrote is correct.

It's not really sure there is a huge win because all blocks are sequential
and readv should merge them but not use vectorization (if I understood
correctly).
---
 contrib/pg_prewarm/pg_prewarm.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index aea07927db..e11694d707 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -26,6 +26,7 @@
 #include "utils/rel.h"
 
 #define PREWARM_PREFETCH_RANGE	RELSEG_SIZE
+#define PREWARM_READ_RANGE	8
 
 PG_MODULE_MAGIC;
 
@@ -38,7 +39,7 @@ typedef enum
 	PREWARM_BUFFER,
 } PrewarmType;
 
-static PGIOAlignedBlock blockbuffer;
+static PGIOAlignedBlock blockbuffers[PREWARM_READ_RANGE];
 
 /*
  * pg_prewarm(regclass, mode text, fork text,
@@ -187,11 +188,19 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 * buffers.  This is more portable than prefetch mode (it works
 		 * everywhere) and is synchronous.
 		 */
-		for (block = first_block; block <= last_block; ++block)
+		char	*buffers[PREWARM_READ_RANGE];
+		for (int i=0; i < PREWARM_READ_RANGE; i++)
+			buffers[i] = blockbuffers[i].data;
+		for (block = first_block; block <= last_block;
+			 block += PREWARM_READ_RANGE)
 		{
+			int seek = Min(PREWARM_READ_RANGE, (last_block - block + 1));
+
 			CHECK_FOR_INTERRUPTS();
-			smgrread(RelationGetSmgr(rel), forkNumber, block, blockbuffer.data);
-			++blocks_done;
+
+			smgrreadv(RelationGetSmgr(rel), forkNumber, block, (void *) buffers,
+					  seek);
+			blocks_done += seek;
 		}
 	}
 	else if (ptype == PREWARM_BUFFER)
-- 
2.39.2

Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

Reply via email to