Hi,

I wonder what you think of making pg_prewarm use recent addition on smgrprefetch and readv ?


In order to try, I did it anyway in the attached patches. They contain no doc update, but I will proceed if it is of interest.

In summary:

1. The first one adds a new check on parameters (checking last block is indeed not before first block).
Consequence is an ERROR is raised instead of silently doing nothing.

2. The second one does implement smgrprefetch with range and loops by default per segment to still have a check for interrupts.

3. The third one provides smgrreadv instead of smgrread,  by default on a range of 8 buffers. I am absolutely unsure that I used readv correctly...

Q: posix_fadvise may not work exactly the way you think it does, or does it ?


In details, and for the question:

It's not so obvious that the "feature" is really required or wanted, depending on what are the expectations from user point of view.

The kernel decides on what to do with posix_fadvise calls, and how we pass parameters does impact the decision. With the current situation where prefetch is done step by step, block by block, they are very probably most of the time all loaded even if those from the beginning of the relation can be discarded at the end of the prefetch.

However,  if instead you provide a real range, or the magic len=0 to posix_fadvise, then blocks are "more" loaded according to effective vm pressure (which is not the case on the previous example). As a result only a small part of the relation might be loaded, and this is probably not what end-users expect despite being probably a good choice (you can still free cache beforehand to help the kernel).

An example, below I'm using vm_relation_cachestat() which provides linux cachestat output, and vm_relation_fadvise() to unload cache, and pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache, not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |        0
      32768 |       32768 |    65536 |        0
      65536 |       32768 |    65536 |        0
      98304 |       32768 |    65536 |        0
     131072 |        1672 |     3344 |        0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
    132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |      320
      32768 |       32768 |    65536 |        0
      65536 |       32768 |    65536 |        0
      98304 |       32768 |    65536 |        0
     131072 |        1672 |     3344 |      320  <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral pg_prewarm('foo','prefetch', 'main', n, n); postgres=# select block_start, block_count, nr_pages, nr_cache from vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
          0 |       32768 |    65536 |    65536
      32768 |       32768 |    65536 |    65536
      65536 |       32768 |    65536 |    65536
      98304 |       32768 |    65536 |    65536
     131072 |        1672 |     3344 |     3344

```

The duration of the last example is also really significant: full relation is 0.3ms and block by block is 1550ms! You might think it's because of generate_series or whatever, but I have the exact same behavior with pgfincore. I can compare loading and unloading duration for similar "async" work, here each call is from block 0 with len of 132744 and a range of 1 block (i.e. posix_fadvise on 8kB at a time). So they have exactly the same number of operations doing DONTNEED or WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```

Thank you for your time reading this longer than expected email.

Comments ?


---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D
From eeb24acd782f1dc2afff80fccfcec442521b9622 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= <cedric.villem...@data-bene.io>
Date: Wed, 3 Jan 2024 19:03:22 +0100
Subject: [PATCH 1/3] Check last block greater or equal to first block in
 pg_prewarm

This prevents useless call and inform users of unexpected value as
parameter.
---
 contrib/pg_prewarm/pg_prewarm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 01fc2c8ad9..1de1b39482 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -133,10 +133,11 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	else
 	{
 		last_block = PG_GETARG_INT64(4);
-		if (last_block < 0 || last_block >= nblocks)
+		if (last_block < first_block || last_block >= nblocks)
 			ereport(ERROR,
 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-					 errmsg("ending block number must be between 0 and %lld",
+					 errmsg("ending block number must be between %lld and %lld",
+							(long long) (first_block),
 							(long long) (nblocks - 1))));
 	}
 
-- 
2.39.2

From 8fd2a0611a13ca40fe7d4c28a27257ba0f7e2a7a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= <cedric.villem...@data-bene.io>
Date: Wed, 3 Jan 2024 19:08:37 +0100
Subject: [PATCH 2/3] Allow pg_prewarm to use new smgrprefetch range

This commit replaces BufferPrefetch with smgrprefetch (with range).
It reduces the number of instructions to prefetch data.

I also defined a PREWARM_PREFETCH_RANGE which is by default the size of
a segment so there is a chance to interrupt the prefetching (on very
large table it might be desirable).
---
 contrib/pg_prewarm/pg_prewarm.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 1de1b39482..aea07927db 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -25,6 +25,8 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
+#define PREWARM_PREFETCH_RANGE	RELSEG_SIZE
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(pg_prewarm);
@@ -156,11 +158,21 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 * no practical way to do that at present without a gross modularity
 		 * violation, so we just do this.
 		 */
-		for (block = first_block; block <= last_block; ++block)
+		for (block = first_block; block <= last_block;
+			 block += PREWARM_PREFETCH_RANGE)
 		{
+			int seek = Min(PREWARM_PREFETCH_RANGE, (last_block - block + 1));
+
+			/*
+			 * if handling a multi-TB relation, we need a way to interrupt the
+			 * prefetching: smgrprefetch (mdprefetch) will loop on all segments
+			 * without interruption so we use a range and keep the following
+			 * CHECK in place
+			 */
 			CHECK_FOR_INTERRUPTS();
-			PrefetchBuffer(rel, forkNumber, block);
-			++blocks_done;
+
+			smgrprefetch(RelationGetSmgr(rel), forkNumber, block, seek);
+			blocks_done += seek;
 		}
 #else
 		ereport(ERROR,
-- 
2.39.2

From cb2fa6f0245b28a1d3f83ac901eb0d8698f22dd3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= <cedric.villem...@data-bene.io>
Date: Wed, 3 Jan 2024 20:16:37 +0100
Subject: [PATCH 3/3] Allow pg_prewarm to use new smgrreadv

This commit replaces smgrread with smgrreadv (with range).

I also defined a PREWARM_READ_RANGE, 8 by default, which is the
number of buffers to read in one call.

I just discovered pg_readv so I hope the initialization I wrote is correct.

It's not really sure there is a huge win because all blocks are sequential
and readv should merge them but not use vectorization (if I understood
correctly).
---
 contrib/pg_prewarm/pg_prewarm.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index aea07927db..e11694d707 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -26,6 +26,7 @@
 #include "utils/rel.h"
 
 #define PREWARM_PREFETCH_RANGE	RELSEG_SIZE
+#define PREWARM_READ_RANGE	8
 
 PG_MODULE_MAGIC;
 
@@ -38,7 +39,7 @@ typedef enum
 	PREWARM_BUFFER,
 } PrewarmType;
 
-static PGIOAlignedBlock blockbuffer;
+static PGIOAlignedBlock blockbuffers[PREWARM_READ_RANGE];
 
 /*
  * pg_prewarm(regclass, mode text, fork text,
@@ -187,11 +188,19 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 * buffers.  This is more portable than prefetch mode (it works
 		 * everywhere) and is synchronous.
 		 */
-		for (block = first_block; block <= last_block; ++block)
+		char	*buffers[PREWARM_READ_RANGE];
+		for (int i=0; i < PREWARM_READ_RANGE; i++)
+			buffers[i] = blockbuffers[i].data;
+		for (block = first_block; block <= last_block;
+			 block += PREWARM_READ_RANGE)
 		{
+			int seek = Min(PREWARM_READ_RANGE, (last_block - block + 1));
+
 			CHECK_FOR_INTERRUPTS();
-			smgrread(RelationGetSmgr(rel), forkNumber, block, blockbuffer.data);
-			++blocks_done;
+
+			smgrreadv(RelationGetSmgr(rel), forkNumber, block, (void *) buffers,
+					  seek);
+			blocks_done += seek;
 		}
 	}
 	else if (ptype == PREWARM_BUFFER)
-- 
2.39.2

Reply via email to