As part of the AIO work [1], Andres mentioned to me that he found that
prefetching tuple memory during hot pruning showed significant wins.
I'm not proposing anything to improve HOT pruning here, but as a segue
to get the prefetching infrastructure in so that there are fewer AIO
patches, I'm proposing we prefetch the next tuple during sequence
scans in while page mode.

It turns out the gains are pretty good when we apply this:

-- table with 4 bytes of user columns
create table t as select a from generate_series(1,10000000)a;
vacuum freeze t;
select pg_prewarm('t');

Master @ a9f8ca600
# select * from t where a = 0;
Time: 355.001 ms
Time: 354.573 ms
Time: 354.490 ms
Time: 354.556 ms
Time: 354.335 ms

Master + 0001 + 0003:
# select * from t where a = 0;
Time: 328.578 ms
Time: 329.387 ms
Time: 329.349 ms
Time: 329.704 ms
Time: 328.225 ms (avg ~7.7% faster)

-- table with 64 bytes of user columns
create table t2 as
select a,a a2,a a3,a a4,a a5,a a6,a a7,a a8,a a9,a a10,a a11,a a12,a
a13,a a14,a a15,a a16
from generate_series(1,10000000)a;
vacuum freeze t2;
select pg_prewarm('t2');

Master:
# select * from t2 where a = 0;
Time: 501.725 ms
Time: 501.815 ms
Time: 503.225 ms
Time: 501.242 ms
Time: 502.394 ms

Master + 0001 + 0003:
# select * from t2 where a = 0;
Time: 412.076 ms
Time: 410.669 ms
Time: 410.490 ms
Time: 409.782 ms
Time: 410.843 ms (avg ~22% faster)

This was tested on an AMD 3990x CPU. I imagine the CPU matters quite a
bit here. It would be interesting to see if the same or similar gains
can be seen on some modern intel chip too.

I believe Thomas wrote the 0001 patch (same as patch in [2]?). I only
quickly put together the 0003 patch.

I wondered if we might want to add a macro to 0001 that says if
pg_prefetch_mem() is empty or not then use that to #ifdef out the code
I added to heapam.c. Although, perhaps most compilers will be able to
optimise away the extra lines that are figuring out what the address
of the next tuple is.

My tests above are likely the best case for this.  It seems plausible
to me that if there was a much more complex plan that found a
reasonable number of tuples and did something with them that we
wouldn't see the same sort of gains. Also, it also does not seem
impossible that the prefetch just results in evicting some
useful-to-some-other-exec-node cache line or that the prefetched tuple
gets flushed out the cache by the time we get around to fetching the
next tuple from the scan again due to various other node processing
that's occurred since the seq scan was last called. I imagine such
things would be indistinguishable from noise, but I've not tested.

I also tried prefetching out by 2 tuples. It didn't help any further
than prefetching 1 tuple.

I'll add this to the November CF.

David

[1] 
https://www.postgresql.org/message-id/flat/20210223100344.llw5an2akleng...@alap3.anarazel.de
[2] 
https://www.postgresql.org/message-id/CA%2BhUKG%2Bpi63ZbcZkYK3XB1pfN%3DkuaDaeV0Ha9E%2BX_p6TTbKBYw%40mail.gmail.com
From 2fd10f1266550f26f4395de080bcdcf89b6859b6 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrow...@gmail.com>
Date: Wed, 19 Oct 2022 08:54:01 +1300
Subject: [PATCH 1/3] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC, Clang and MSVC builtins.

Discussion: 
https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  3 ++-
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 src/tools/msvc/Solution.pm |  1 +
 7 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 000b075312..582a47501c 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 3966368b8d..c4685b8a1e 100755
--- a/configure
+++ b/configure
@@ -15988,6 +15988,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" 
>&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value 
needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index f76b7ee31f..2d4938d43d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1802,6 +1802,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned 
int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index bfacbdc0af..9c35637826 100644
--- a/meson.build
+++ b/meson.build
@@ -1587,10 +1587,11 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
-  'ctz',
   'constant_p',
+  'ctz',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index d70ed84ac5..26a1586dc3 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -361,6 +361,14 @@ typedef void (*pg_funcptr_t) (void);
  */
 #define FLEXIBLE_ARRAY_MEMBER  /* empty */
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *                             Section 2:      bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c5a80b829e..07a661e288 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -559,6 +559,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index c2acb58df0..95de91890e 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -227,6 +227,7 @@ sub GenerateFiles
                HAVE_BACKTRACE_SYMBOLS     => undef,
                HAVE_BIO_GET_DATA          => undef,
                HAVE_BIO_METH_NEW          => undef,
+               HAVE__BUILTIN_PREFETCH     => undef,
                HAVE_COMPUTED_GOTO         => undef,
                HAVE_COPYFILE              => undef,
                HAVE_COPYFILE_H            => undef,
-- 
2.34.1

From 8459bc4bcdf0403f8c9513dd4d1fed0840acafc1 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrow...@gmail.com>
Date: Mon, 31 Oct 2022 10:05:12 +1300
Subject: [PATCH 3/3] Prefetch tuple memory during forward seqscans

---
 src/backend/access/heap/heapam.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 12be87efed..e8f1fc2d71 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1025,6 +1025,17 @@ heapgettup_pagemode(HeapScanDesc scan,
                        tuple->t_len = ItemIdGetLength(lpp);
                        ItemPointerSet(&(tuple->t_self), page, lineoff);
 
+                       /*
+                        * Prefetching the memory for the next tuple has shown 
to improve
+                        * performance on certain hardware.
+                        */
+                       if (!backward && linesleft > 1)
+                       {
+                               lineoff = scan->rs_vistuples[lineindex + 1];
+                               lpp = PageGetItemId(dp, lineoff);
+                               pg_prefetch_mem(PageGetItem((Page) dp, lpp));
+                       }
+
                        /*
                         * if current tuple qualifies, return it.
                         */
-- 
2.34.1

Reply via email to