Hi,

Here is an experimental patch for read_stream.c.  The basic idea is
that when read_stream_next_buffer() gives you a page P1, it should
also tell the CPU to prefetch the  header of the next page P2, and so
on.  However, I recognise that its lack of timing control may be a
fundamental flaw (see below).  It just ... happens to work sometimes.
Can we do better?

One thought is that this might really belong in the AM, along with
David Rowley's work on intra-page memory prefetching[1].  Stepping
back a bit to explain why I wonder about that...

What I observe is that sometimes the attached patch makes a wide
variety of well-cached table scans measurably faster, unscientifically
something like 10-15%ish.  Other times it doesn't, but it doesn't
hurt.

Which ones and when on which CPUs, I'm not yet sure.  One observation
is that if you naively pg_prewarm an empty buffer pool the table is
contiguous in memory and the hardware prefetcher[2] can perhaps detect
a sequential memory scan (?), but that's not like real life: if you
run a system for a while its buffer pool turns into scrambled eggs,
and then I guess the hardware prefetcher can't predict much (I dunno,
we aren't passing around addresses for it to speculate about, just
buffer IDs, but maybe it's not beyond the realms of array index
detection (?), I dunno).  I don't know if that is a factor, and I
don't know enough about CPUs to have specific ideas yet.

The problem with all this is that it's obviously hard to reason about
timing down here in lower level code with our tuple-at-a-time-volcano
executor design.  I've tangled with variants of this stuff before for
hash joins[3] (uncommitted, sadly), and that was actually really quite
successful!, but that tried to tackle the timing problem head on: it
created its own batching opportunities, instead of letting the program
counter escape into the wilderness.  Part of the reason I got stuck
with that project is that I started wanting general batch mode support
and batch mode tuple-lifetime instead of localised hacks, and then the
quicksand got me.

Concretely, if the caller does so much work with P1 that by the time
it fetches P2, P2's header has fallen out of [some level of] cache,
then it won't help [at that level at least].  In theory it has also
committed the crime of cache pollution in that case, ie possibly
kicked something useful out of [some level of] cache.  It is hard for
read_stream.c to know what the caller is doing between calls, which
might include running up and down a volcano 100 times.  Perhaps that
means the whole problem should be punted to the caller -- if heap
scans would like to prefetch the next page's memory, then they should
simply call read_stream_next_buffer() slightly before they finish
processing P1, so they can prefetch whatever they want themselves, and
micro-manage the timing better.  Or some more complex reordering of
work.

That leads to the idea that it might somehow actually fit in better
with David's earlier work; can we treat the inter-page hop as just one
of the many hops in the general stream of memory accesses that he was
already hacking on for prefetching fragments of heap pages?  Or
something.  Maybe the attached always leaves the header in at least L2
even in the worst real cases and so it's worth considering!  I dunno.
But I couldn't look at a > 10% performance increase from 2 lines of
code and not share, hence this hand-wavy nerd-snipy email.

[1] 
https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com
[2] 
https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp18/cse502/slides/13-prefetch.pdf
[3] 
https://www.postgresql.org/message-id/flat/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
From 2ffbc634839c94a7da1bae9ab3976937f46fb874 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrow...@gmail.com>
Date: Mon, 10 Jul 2023 11:22:34 +0200
Subject: [PATCH 1/2] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC/Clang and MSVC builtins.

Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  1 +
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 6 files changed, 72 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 10f8c7bd0a9..befd7c3b316 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index cfbd2a096f4..9d9ed38fe57 100755
--- a/configure
+++ b/configure
@@ -15546,6 +15546,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index 67e738d92b1..e991e56613c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1781,6 +1781,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index 5acf083ce3c..eecd8aa6f45 100644
--- a/meson.build
+++ b/meson.build
@@ -1707,6 +1707,7 @@ builtins = [
   'constant_p',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index dc1841346cd..1cb191b02a4 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -425,6 +425,14 @@ typedef void (*pg_funcptr_t) (void);
 #define HAVE_PRAGMA_GCC_SYSTEM_HEADER	1
 #endif
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f8d3e3b6b84..6f0b6a51417 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -546,6 +546,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
-- 
2.44.0

From 0f1a87954e27cd6e59e3ef45b610677b13a3985b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 5 Apr 2024 15:06:32 +1300
Subject: [PATCH 2/2] Prefetch page header memory when streaming relations.

read_stream.c can always see at least one page ahead of the one the
caller is accessing.  Take the opportunity to prefetch the cache line
that holds the next page's header.  For some scans, that can generate a
decent speedup, though real world results will depend on how much work
the CPU actually does before it gets around to accessing the next page.

Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index f54dacdd914..2e45dcdcd4b 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -617,7 +617,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 										stream->advice_enabled ?
 										READ_BUFFERS_ISSUE_ADVICE : 0)))
 			{
-				/* Fast return. */
+				/* Predict caller will soon access next page's header. */
+				pg_prefetch_mem(BufferGetPage(stream->buffers[oldest_buffer_index]));
 				return buffer;
 			}
 
@@ -748,6 +749,10 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	/* Prepare for the next call. */
 	read_stream_look_ahead(stream, false);
 
+	/* Predict caller will soon access next page's header. */
+	if (stream->pinned_buffers > 0)
+		pg_prefetch_mem(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));
+
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
-- 
2.44.0

Reply via email to