As part of the AIO work [1], Andres mentioned to me that he found that prefetching tuple memory during hot pruning showed significant wins. I'm not proposing anything to improve HOT pruning here, but as a segue to get the prefetching infrastructure in so that there are fewer AIO patches, I'm proposing we prefetch the next tuple during sequence scans in while page mode.
It turns out the gains are pretty good when we apply this: -- table with 4 bytes of user columns create table t as select a from generate_series(1,10000000)a; vacuum freeze t; select pg_prewarm('t'); Master @ a9f8ca600 # select * from t where a = 0; Time: 355.001 ms Time: 354.573 ms Time: 354.490 ms Time: 354.556 ms Time: 354.335 ms Master + 0001 + 0003: # select * from t where a = 0; Time: 328.578 ms Time: 329.387 ms Time: 329.349 ms Time: 329.704 ms Time: 328.225 ms (avg ~7.7% faster) -- table with 64 bytes of user columns create table t2 as select a,a a2,a a3,a a4,a a5,a a6,a a7,a a8,a a9,a a10,a a11,a a12,a a13,a a14,a a15,a a16 from generate_series(1,10000000)a; vacuum freeze t2; select pg_prewarm('t2'); Master: # select * from t2 where a = 0; Time: 501.725 ms Time: 501.815 ms Time: 503.225 ms Time: 501.242 ms Time: 502.394 ms Master + 0001 + 0003: # select * from t2 where a = 0; Time: 412.076 ms Time: 410.669 ms Time: 410.490 ms Time: 409.782 ms Time: 410.843 ms (avg ~22% faster) This was tested on an AMD 3990x CPU. I imagine the CPU matters quite a bit here. It would be interesting to see if the same or similar gains can be seen on some modern intel chip too. I believe Thomas wrote the 0001 patch (same as patch in [2]?). I only quickly put together the 0003 patch. I wondered if we might want to add a macro to 0001 that says if pg_prefetch_mem() is empty or not then use that to #ifdef out the code I added to heapam.c. Although, perhaps most compilers will be able to optimise away the extra lines that are figuring out what the address of the next tuple is. My tests above are likely the best case for this. It seems plausible to me that if there was a much more complex plan that found a reasonable number of tuples and did something with them that we wouldn't see the same sort of gains. Also, it also does not seem impossible that the prefetch just results in evicting some useful-to-some-other-exec-node cache line or that the prefetched tuple gets flushed out the cache by the time we get around to fetching the next tuple from the scan again due to various other node processing that's occurred since the seq scan was last called. I imagine such things would be indistinguishable from noise, but I've not tested. I also tried prefetching out by 2 tuples. It didn't help any further than prefetching 1 tuple. I'll add this to the November CF. David [1] https://www.postgresql.org/message-id/flat/20210223100344.llw5an2akleng...@alap3.anarazel.de [2] https://www.postgresql.org/message-id/CA%2BhUKG%2Bpi63ZbcZkYK3XB1pfN%3DkuaDaeV0Ha9E%2BX_p6TTbKBYw%40mail.gmail.com
From 2fd10f1266550f26f4395de080bcdcf89b6859b6 Mon Sep 17 00:00:00 2001 From: David Rowley <dgrow...@gmail.com> Date: Wed, 19 Oct 2022 08:54:01 +1300 Subject: [PATCH 1/3] Add pg_prefetch_mem() macro to load cache lines. Initially mapping to GCC, Clang and MSVC builtins. Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com --- config/c-compiler.m4 | 17 ++++++++++++++++ configure | 40 ++++++++++++++++++++++++++++++++++++++ configure.ac | 3 +++ meson.build | 3 ++- src/include/c.h | 8 ++++++++ src/include/pg_config.h.in | 3 +++ src/tools/msvc/Solution.pm | 1 + 7 files changed, 74 insertions(+), 1 deletion(-) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 000b075312..582a47501c 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, [Define to 1 if your compiler understands $1.]) fi])# PGAC_CHECK_BUILTIN_FUNC +# PGAC_CHECK_BUILTIN_VOID_FUNC +# ----------------------- +# Variant for void functions. +AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC], +[AC_CACHE_CHECK(for $1, pgac_cv$1, +[AC_LINK_IFELSE([AC_LANG_PROGRAM([ +void +call$1($2) +{ + $1(x); +}], [])], +[pgac_cv$1=yes], +[pgac_cv$1=no])]) +if test x"${pgac_cv$1}" = xyes ; then +AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, + [Define to 1 if your compiler understands $1.]) +fi])# PGAC_CHECK_BUILTIN_VOID_FUNC # PGAC_CHECK_BUILTIN_FUNC_PTR diff --git a/configure b/configure index 3966368b8d..c4685b8a1e 100755 --- a/configure +++ b/configure @@ -15988,6 +15988,46 @@ _ACEOF fi +# Can we use a built-in to prefetch memory? +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5 +$as_echo_n "checking for __builtin_prefetch... " >&6; } +if ${pgac_cv__builtin_prefetch+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +void +call__builtin_prefetch(void *x) +{ + __builtin_prefetch(x); +} +int +main () +{ + + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv__builtin_prefetch=yes +else + pgac_cv__builtin_prefetch=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5 +$as_echo "$pgac_cv__builtin_prefetch" >&6; } +if test x"${pgac_cv__builtin_prefetch}" = xyes ; then + +cat >>confdefs.h <<_ACEOF +#define HAVE__BUILTIN_PREFETCH 1 +_ACEOF + +fi + # We require 64-bit fseeko() to be available, but run this check anyway # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that. { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5 diff --git a/configure.ac b/configure.ac index f76b7ee31f..2d4938d43d 100644 --- a/configure.ac +++ b/configure.ac @@ -1802,6 +1802,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x]) # so it needs a different test function. PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0]) +# Can we use a built-in to prefetch memory? +PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x]) + # We require 64-bit fseeko() to be available, but run this check anyway # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that. AC_FUNC_FSEEKO diff --git a/meson.build b/meson.build index bfacbdc0af..9c35637826 100644 --- a/meson.build +++ b/meson.build @@ -1587,10 +1587,11 @@ builtins = [ 'bswap32', 'bswap64', 'clz', - 'ctz', 'constant_p', + 'ctz', 'frame_address', 'popcount', + 'prefetch', 'unreachable', ] diff --git a/src/include/c.h b/src/include/c.h index d70ed84ac5..26a1586dc3 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -361,6 +361,14 @@ typedef void (*pg_funcptr_t) (void); */ #define FLEXIBLE_ARRAY_MEMBER /* empty */ +/* Do we have support for prefetching memory? */ +#if defined(HAVE__BUILTIN_PREFETCH) +#define pg_prefetch_mem(a) __builtin_prefetch(a) +#elif defined(_MSC_VER) +#define pg_prefetch_mem(a) _m_prefetch(a) +#else +#define pg_prefetch_mem(a) +#endif /* ---------------------------------------------------------------- * Section 2: bool, true, false diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index c5a80b829e..07a661e288 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -559,6 +559,9 @@ /* Define to 1 if your compiler understands __builtin_popcount. */ #undef HAVE__BUILTIN_POPCOUNT +/* Define to 1 if your compiler understands __builtin_prefetch. */ +#undef HAVE__BUILTIN_PREFETCH + /* Define to 1 if your compiler understands __builtin_types_compatible_p. */ #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index c2acb58df0..95de91890e 100644 --- a/src/tools/msvc/Solution.pm +++ b/src/tools/msvc/Solution.pm @@ -227,6 +227,7 @@ sub GenerateFiles HAVE_BACKTRACE_SYMBOLS => undef, HAVE_BIO_GET_DATA => undef, HAVE_BIO_METH_NEW => undef, + HAVE__BUILTIN_PREFETCH => undef, HAVE_COMPUTED_GOTO => undef, HAVE_COPYFILE => undef, HAVE_COPYFILE_H => undef, -- 2.34.1
From 8459bc4bcdf0403f8c9513dd4d1fed0840acafc1 Mon Sep 17 00:00:00 2001 From: David Rowley <dgrow...@gmail.com> Date: Mon, 31 Oct 2022 10:05:12 +1300 Subject: [PATCH 3/3] Prefetch tuple memory during forward seqscans --- src/backend/access/heap/heapam.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 12be87efed..e8f1fc2d71 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -1025,6 +1025,17 @@ heapgettup_pagemode(HeapScanDesc scan, tuple->t_len = ItemIdGetLength(lpp); ItemPointerSet(&(tuple->t_self), page, lineoff); + /* + * Prefetching the memory for the next tuple has shown to improve + * performance on certain hardware. + */ + if (!backward && linesleft > 1) + { + lineoff = scan->rs_vistuples[lineindex + 1]; + lpp = PageGetItemId(dp, lineoff); + pg_prefetch_mem(PageGetItem((Page) dp, lpp)); + } + /* * if current tuple qualifies, return it. */ -- 2.34.1