I've always thought that GCC intrinsics for software-based memory prefetching are a very sharp tool. While it's really hard to use GCC's __builtin_prefetch() effectively (I've tried enough times to know), I always thought that there'd probably end up being a handful of cases where using it presented a clear and worthwhile win. Hash joins seemed to me to be a likely candidate, because memory latency is a real killer there. However, I stumbled upon another case where prefetching pays clear dividends while actually being quite simple: sequentially reading an already-sorted memtuples array (an array of SortTuple structs), during some kind of post-processing.
This comes up for many common and important cases, including regular sort nodes (heaptuple sorts), datumsorts with pass-by-reference types (e.g. ordered set aggregates), and B-Tree index builds, where the sequential fetching occurs as actual B-Tree leaf pages are built from the array. Patch ===== Attached patch adds a portable Postgres wrapper on the GCC intrinsic. It also adds a client within tuplesort.c -- a common function that seems like a good generic choke point. I can get a speed-up of 6% - 9% for all of these cases by prefetching a few SortTuples ahead, for the "tuple proper". (In-memory sorts only.) ISTM that this alone makes the patch worthwhile, but we should still consider the break-down of costs in each of these operations, and that the cost of the sequential reading and processing of SortTuples is all that has been touched by the patch. There are only about 3 real lines of code added to tuplesort. Obviously, the dominant costs within each of these queries/utility commands are the sort proper, and to a lesser extent the heap scan preceding it. I am not targeting those at all -- I'm only targeting the tertiary cost of sequentially scanning the memtuples array by hiding memory latency, and yet I end up with a notable overall speed-up. It's only really fair to look at this final sequential fetching cost in isolation. The reduction in that cost in isolation seems to be huge. I tried out the proprietary Intel vTune Amplifier utility with this case, and it indicated that CPU time spent in calls to _bt_buildadd was less than 1/8 the previous time in a comparable test of the master branch, with no other changes that I can attribute to this patch (i.e. there is a bit of noise). You can easily get some hint of the size of the improvement by reviewing "trace_sort = on" log output, and the reduction of time spent after "performsort" is done but before the sort is shut down. Tuplestores ---------------- tuplestore.c has a similar optimization added, on the theory that it should match tuplesort.c wherever possible, and because it seems to help a lot there too. For example, queries that perform a big CTE Scan seem to benefit to about the same degree (although, again, only when the array is fully in memory). Portability concerns =============== I started writing this patch with an intuition that this would help. I arrived at the fixed-up version posted here through frobbing the code based on the feedback of Postgres running on my laptop. I iteratively tweaked the number of tuples ahead to fetch from, and the __builtin_prefetch() temporal locality argument's value, which helped a little for a number of test cases. If that doesn't inspire much confidence in this patch, then good -- that's the idea. Obviously if we are going to commit this, there needs to be testing of a reasonably representative selection of platforms. On the other hand, I couldn't find a case that was regressed, and I am encouraged by the fact that this helps the post-processing of SortTuples so effectively. I imagine that some platform would have very different performance characteristics in order for this to be the wrong thing. However, it seems possible that there is a case that has been regressed, since the prefetch is added at a very generic point within tuplesort.c and tuplestore.c. -- Peter Geoghegan
From 3c44ec9a9c68da3408716e06d0e822acfc68a746 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <peter.geoghega...@gmail.com> Date: Sun, 12 Jul 2015 13:14:01 -0700 Subject: [PATCH] Prefetch from memtuples array in tuplesort Testing shows that prefetching the "tuple proper" of a slightly later SortTuple in the memtuples array during each of many sequential, in-logical-order SortTuple fetches speeds up various sorting intense operations considerably. For example, B-Tree index builds are accelerated as leaf pages are created from the memtuples array. (i.e. The operation following actually "performing" the sort, but before a tuplesort_end() call is made as a B-Tree spool is destroyed.) Similarly, ordered set aggregates (all cases except the datumsort case with a pass-by-value type), and regular heap tuplesorts benefit to about the same degree. The optimization is only used when sorts fit in memory, though. Also, prefetch a few places ahead within the analogous "fetching" point in tuplestore.c. This appears to offer similar benefits in certain cases. For example, queries involving large common table expressions significantly benefit. --- config/c-compiler.m4 | 17 +++++++++++++++++ configure | 30 ++++++++++++++++++++++++++++++ configure.in | 1 + src/backend/utils/sort/tuplesort.c | 17 +++++++++++++++++ src/backend/utils/sort/tuplestore.c | 11 +++++++++++ src/include/c.h | 19 +++++++++++++++++++ src/include/pg_config.h.in | 3 +++ src/include/pg_config.h.win32 | 3 +++ 8 files changed, 101 insertions(+) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 050bfa5..5776201 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -287,6 +287,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE +# PGAC_C_BUILTIN_PREFETCH +# ------------------------- +# Check if the C compiler understands __builtin_prefetch(), +# and define HAVE__BUILTIN_PREFETCH if so. +AC_DEFUN([PGAC_C_BUILTIN_PREFETCH], +[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch, +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], +[int i = 0;__builtin_prefetch(&i, 0, 3);])], +[pgac_cv__builtin_prefetch=yes], +[pgac_cv__builtin_prefetch=no])]) +if test x"$pgac_cv__builtin_prefetch" = xyes ; then +AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1, + [Define to 1 if your compiler understands __builtin_prefetch.]) +fi])# PGAC_C_BUILTIN_PREFETCH + + + # PGAC_C_VA_ARGS # -------------- # Check if the C compiler understands C99-style variadic macros, diff --git a/configure b/configure index 77048e8..f14e65d 100755 --- a/configure +++ b/configure @@ -11248,6 +11248,36 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then $as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5 +$as_echo_n "checking for __builtin_prefetch... " >&6; } +if ${pgac_cv__builtin_prefetch+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +int +main () +{ +int i = 0;__builtin_prefetch(&i, 0, 3); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile "$LINENO"; then : + pgac_cv__builtin_prefetch=yes +else + pgac_cv__builtin_prefetch=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5 +$as_echo "$pgac_cv__builtin_prefetch" >&6; } +if test x"$pgac_cv__builtin_prefetch" = xyes ; then + +$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h + +fi { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5 $as_echo_n "checking for __VA_ARGS__... " >&6; } if ${pgac_cv__va_args+:} false; then : diff --git a/configure.in b/configure.in index 5724a4d..2132d69 100644 --- a/configure.in +++ b/configure.in @@ -1318,6 +1318,7 @@ PGAC_C_TYPES_COMPATIBLE PGAC_C_BUILTIN_BSWAP32 PGAC_C_BUILTIN_CONSTANT_P PGAC_C_BUILTIN_UNREACHABLE +PGAC_C_BUILTIN_PREFETCH PGAC_C_VA_ARGS PGAC_STRUCT_TIMEZONE PGAC_UNION_SEMUN diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 435041a..b999dbf 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -1662,6 +1662,23 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward, { if (state->current < state->memtupcount) { + /* + * Perform memory prefetch of "tuple proper" of the + * SortTuple that's three places ahead of current + * (which is returned to caller). Testing shows that + * this significantly boosts the performance for + * TSS_SORTEDINMEM "forward" callers by hiding memory + * latency behind their processing of returned tuples. + * + * Don't do this for pass-by-value datum sorts; even + * though hinting a NULL address does not affect + * correctness, it would have a noticeable overhead + * here. + */ + if (stup->tuple != NULL && + state->current + 3 < state->memtupcount) + pg_rfetch(state->memtuples[state->current + 3].tuple); + *stup = state->memtuples[state->current++]; return true; } diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c index 627b281..49afb24 100644 --- a/src/backend/utils/sort/tuplestore.c +++ b/src/backend/utils/sort/tuplestore.c @@ -895,6 +895,17 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward, return NULL; if (readptr->current < state->memtupcount) { + /* + * Perform memory prefetch of tuple that's three places + * ahead of current (which is returned to caller). + * Testing shows that this significantly boosts the + * performance for TSS_INMEM "forward" callers by + * hiding memory latency behind their processing of + * returned tuples. + */ + if (readptr->current + 3 < state->memtupcount) + pg_rfetch(state->memtuples[readptr->current + 3]); + /* We have another tuple, so return it */ return state->memtuples[readptr->current++]; } diff --git a/src/include/c.h b/src/include/c.h index 92c5202..cab7faf 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -919,6 +919,25 @@ typedef NameData *Name; #define pg_unreachable() abort() #endif +/* + * Prefetch support -- Support memory prefetching hints on some platforms. + * + * pg_rfetch() is specialized for the case where an array is accessed + * sequentially, and we can prefetch a pointer within the next element (or an + * even later element) in order to hide memory latency. This case involves + * prefetching addresses with low temporal locality. Note that it's rather + * difficult to get any kind of speedup using pg_rfetch(); any use of the + * intrinsic should be carefully tested. Also note that it's okay to pass it + * an invalid or NULL address, although it's best avoided. + * + * Follow the example of GCC's __builtin_prefetch() itself, and still evaluate + * "addr" on platforms without support. + */ +#if defined(HAVE__BUILTIN_PREFETCH) +#define pg_rfetch(addr) __builtin_prefetch((addr), 0, 0) +#else +#define pg_rfetch(addr) ((void) (addr)) +#endif /* * Function inlining support -- Allow modules to define functions that may be diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 5688f75..6aad69b 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -669,6 +669,9 @@ /* Define to 1 if your compiler understands __builtin_constant_p. */ #undef HAVE__BUILTIN_CONSTANT_P +/* Define to 1 if your compiler understands __builtin_prefetch. */ +#undef HAVE__BUILTIN_PREFETCH + /* Define to 1 if your compiler understands __builtin_types_compatible_p. */ #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32 index 22bbb91..3e729c5 100644 --- a/src/include/pg_config.h.win32 +++ b/src/include/pg_config.h.win32 @@ -523,6 +523,9 @@ /* Define to 1 if your compiler understands __builtin_constant_p. */ /* #undef HAVE__BUILTIN_CONSTANT_P */ +/* Define to 1 if your compiler understands __builtin_prefetch. */ +#undef HAVE__BUILTIN_PREFETCH + /* Define to 1 if your compiler understands __builtin_types_compatible_p. */ /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */ -- 1.9.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers