Hi, Here is an experimental patch for read_stream.c. The basic idea is that when read_stream_next_buffer() gives you a page P1, it should also tell the CPU to prefetch the header of the next page P2, and so on. However, I recognise that its lack of timing control may be a fundamental flaw (see below). It just ... happens to work sometimes. Can we do better?
One thought is that this might really belong in the AM, along with David Rowley's work on intra-page memory prefetching[1]. Stepping back a bit to explain why I wonder about that... What I observe is that sometimes the attached patch makes a wide variety of well-cached table scans measurably faster, unscientifically something like 10-15%ish. Other times it doesn't, but it doesn't hurt. Which ones and when on which CPUs, I'm not yet sure. One observation is that if you naively pg_prewarm an empty buffer pool the table is contiguous in memory and the hardware prefetcher[2] can perhaps detect a sequential memory scan (?), but that's not like real life: if you run a system for a while its buffer pool turns into scrambled eggs, and then I guess the hardware prefetcher can't predict much (I dunno, we aren't passing around addresses for it to speculate about, just buffer IDs, but maybe it's not beyond the realms of array index detection (?), I dunno). I don't know if that is a factor, and I don't know enough about CPUs to have specific ideas yet. The problem with all this is that it's obviously hard to reason about timing down here in lower level code with our tuple-at-a-time-volcano executor design. I've tangled with variants of this stuff before for hash joins[3] (uncommitted, sadly), and that was actually really quite successful!, but that tried to tackle the timing problem head on: it created its own batching opportunities, instead of letting the program counter escape into the wilderness. Part of the reason I got stuck with that project is that I started wanting general batch mode support and batch mode tuple-lifetime instead of localised hacks, and then the quicksand got me. Concretely, if the caller does so much work with P1 that by the time it fetches P2, P2's header has fallen out of [some level of] cache, then it won't help [at that level at least]. In theory it has also committed the crime of cache pollution in that case, ie possibly kicked something useful out of [some level of] cache. It is hard for read_stream.c to know what the caller is doing between calls, which might include running up and down a volcano 100 times. Perhaps that means the whole problem should be punted to the caller -- if heap scans would like to prefetch the next page's memory, then they should simply call read_stream_next_buffer() slightly before they finish processing P1, so they can prefetch whatever they want themselves, and micro-manage the timing better. Or some more complex reordering of work. That leads to the idea that it might somehow actually fit in better with David's earlier work; can we treat the inter-page hop as just one of the many hops in the general stream of memory accesses that he was already hacking on for prefetching fragments of heap pages? Or something. Maybe the attached always leaves the header in at least L2 even in the worst real cases and so it's worth considering! I dunno. But I couldn't look at a > 10% performance increase from 2 lines of code and not share, hence this hand-wavy nerd-snipy email. [1] https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com [2] https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp18/cse502/slides/13-prefetch.pdf [3] https://www.postgresql.org/message-id/flat/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
From 2ffbc634839c94a7da1bae9ab3976937f46fb874 Mon Sep 17 00:00:00 2001 From: David Rowley <dgrow...@gmail.com> Date: Mon, 10 Jul 2023 11:22:34 +0200 Subject: [PATCH 1/2] Add pg_prefetch_mem() macro to load cache lines. Initially mapping to GCC/Clang and MSVC builtins. Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com --- config/c-compiler.m4 | 17 ++++++++++++++++ configure | 40 ++++++++++++++++++++++++++++++++++++++ configure.ac | 3 +++ meson.build | 1 + src/include/c.h | 8 ++++++++ src/include/pg_config.h.in | 3 +++ 6 files changed, 72 insertions(+) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 10f8c7bd0a9..befd7c3b316 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, [Define to 1 if your compiler understands $1.]) fi])# PGAC_CHECK_BUILTIN_FUNC +# PGAC_CHECK_BUILTIN_VOID_FUNC +# ----------------------- +# Variant for void functions. +AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC], +[AC_CACHE_CHECK(for $1, pgac_cv$1, +[AC_LINK_IFELSE([AC_LANG_PROGRAM([ +void +call$1($2) +{ + $1(x); +}], [])], +[pgac_cv$1=yes], +[pgac_cv$1=no])]) +if test x"${pgac_cv$1}" = xyes ; then +AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, + [Define to 1 if your compiler understands $1.]) +fi])# PGAC_CHECK_BUILTIN_VOID_FUNC # PGAC_CHECK_BUILTIN_FUNC_PTR diff --git a/configure b/configure index cfbd2a096f4..9d9ed38fe57 100755 --- a/configure +++ b/configure @@ -15546,6 +15546,46 @@ _ACEOF fi +# Can we use a built-in to prefetch memory? +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5 +$as_echo_n "checking for __builtin_prefetch... " >&6; } +if ${pgac_cv__builtin_prefetch+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +void +call__builtin_prefetch(void *x) +{ + __builtin_prefetch(x); +} +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__builtin_prefetch=yes +else + pgac_cv__builtin_prefetch=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5 +$as_echo "$pgac_cv__builtin_prefetch" >&6; } +if test x"${pgac_cv__builtin_prefetch}" = xyes ; then + +cat >>confdefs.h <<_ACEOF +#define HAVE__BUILTIN_PREFETCH 1 +_ACEOF + +fi + # We require 64-bit fseeko() to be available, but run this check anyway # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that. { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5 diff --git a/configure.ac b/configure.ac index 67e738d92b1..e991e56613c 100644 --- a/configure.ac +++ b/configure.ac @@ -1781,6 +1781,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x]) # so it needs a different test function. PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0]) +# Can we use a built-in to prefetch memory? +PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x]) + # We require 64-bit fseeko() to be available, but run this check anyway # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that. AC_FUNC_FSEEKO diff --git a/meson.build b/meson.build index 5acf083ce3c..eecd8aa6f45 100644 --- a/meson.build +++ b/meson.build @@ -1707,6 +1707,7 @@ builtins = [ 'constant_p', 'frame_address', 'popcount', + 'prefetch', 'unreachable', ] diff --git a/src/include/c.h b/src/include/c.h index dc1841346cd..1cb191b02a4 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -425,6 +425,14 @@ typedef void (*pg_funcptr_t) (void); #define HAVE_PRAGMA_GCC_SYSTEM_HEADER 1 #endif +/* Do we have support for prefetching memory? */ +#if defined(HAVE__BUILTIN_PREFETCH) +#define pg_prefetch_mem(a) __builtin_prefetch(a) +#elif defined(_MSC_VER) +#define pg_prefetch_mem(a) _m_prefetch(a) +#else +#define pg_prefetch_mem(a) +#endif /* ---------------------------------------------------------------- * Section 2: bool, true, false diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index f8d3e3b6b84..6f0b6a51417 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -546,6 +546,9 @@ /* Define to 1 if your compiler understands __builtin_popcount. */ #undef HAVE__BUILTIN_POPCOUNT +/* Define to 1 if your compiler understands __builtin_prefetch. */ +#undef HAVE__BUILTIN_PREFETCH + /* Define to 1 if your compiler understands __builtin_types_compatible_p. */ #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P -- 2.44.0
From 0f1a87954e27cd6e59e3ef45b610677b13a3985b Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Fri, 5 Apr 2024 15:06:32 +1300 Subject: [PATCH 2/2] Prefetch page header memory when streaming relations. read_stream.c can always see at least one page ahead of the one the caller is accessing. Take the opportunity to prefetch the cache line that holds the next page's header. For some scans, that can generate a decent speedup, though real world results will depend on how much work the CPU actually does before it gets around to accessing the next page. Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com --- src/backend/storage/aio/read_stream.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index f54dacdd914..2e45dcdcd4b 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -617,7 +617,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) stream->advice_enabled ? READ_BUFFERS_ISSUE_ADVICE : 0))) { - /* Fast return. */ + /* Predict caller will soon access next page's header. */ + pg_prefetch_mem(BufferGetPage(stream->buffers[oldest_buffer_index])); return buffer; } @@ -748,6 +749,10 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) /* Prepare for the next call. */ read_stream_look_ahead(stream, false); + /* Predict caller will soon access next page's header. */ + if (stream->pinned_buffers > 0) + pg_prefetch_mem(BufferGetPage(stream->buffers[stream->oldest_buffer_index])); + #ifndef READ_STREAM_DISABLE_FAST_PATH /* See if we can take the fast path for all-cached scans next time. */ if (stream->ios_in_progress == 0 && -- 2.44.0