Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Lukas Fittl Thu, 12 Feb 2026 08:06:32 -0800

On Wed, Feb 4, 2026 at 2:09 AM David Geier <[email protected]> wrote:
> On 03.02.2026 18:44, Andres Freund wrote:
> > Whereas it'll not make sense for anything that needs wall clock times - 
> > which
> > imo makes a "clock_source" GUC misnamed.  Maybe "clock_source_timing" or 
> > such?
>
> Makes sense. clock_source_timing works for me, or maybe easier to read
> would be timing_clock_source. But doesn't matter much.


I've gone with "timing_clock_source" for now, that seems best to me so far.

See attached v6, which has the following changes:

Adds a new commit (0002) that:

- Unifies the Windows QueryPerformanceCounter() implementation with
the TSC implementation, and introduces the pg_ticks_to_ns function and
the ticks_per_ns_scaled conversion ratio in this commit.
- I realized that we were unnecessarily calling
QueryPerformanceFrequency on Windows every time we got the ticks, and
this is the same problem we're solving for the TSC frequency.
- This also allows us to independently test if the overhead of
overflow handling of pg_ticks_to_ns is a problem when clock_gettime is
used (since we're always exceeding max_ticks_no_overflow -- see end of
email), and makes the code read much better, I think.
- The one downside is that this means every program that wants to use
INSTR_* macros now has to call pg_initialize_timing() first. In 0003
we can then rely on the GUC mechanism to do this in the backend, as
before.

For the main TSC commit (now 0003):

- Enable the use of TSC on Windows when set explicitly via the GUC,
since I couldn't find a good reason why not - but note I have not done
any manual testing on Windows yet.
- Refactoring to improve readability and better split TSC logic from
general timing clock source logic

In regards to the GUC (part of 0003):

- Renamed to "timing_clock_source", with three settings:
  - "auto" that will use the TSC clock source if we're on Linux and
Linux itself uses the TSC clock source
  - "system" will force use of the system APIs, i.e. clock_gettime()
or QueryPerformanceCounter() -- this was named "off" before, but I
think "system" is more clear
  - "tsc" will force the use of RDTSC/RDTSCP on x86-64, and will fail
if it is not available. Not a possible setting on other architectures.
-- this was named "rdtsc" before, but I think "tsc" is better, since
we use a mix of RDTSC and RDTSCP
- Resolves the ubsan issue in CI, which was caused by a missing
addition to "config_group_names" for the new "Resource / Time" GUC
group. I wonder if we should make this "Resource / Other" instead
though? (it seems unlikely we'll have another GUC for time
specifically?)
- Implements a show hook for the GUC that will show the current value
in parenthesis with auto is selected. Is this sufficient to address
the use case of wanting to know the current clock source?

postgres=# SHOW timing_clock_source;
 timing_clock_source
---------------------
 auto (tsc)

For the pg_test_timing change (0004):
- If available, show both RDTSC and RDTSCP timings (RDTSC indicated as
"Fast"), as well as the system clock source, to help decide whether to
enable TSC.

FWIW, I have not yet investigated expanding the use of fast timing to
other places (e.g. track_io_timing/track_wal_io_timing) as suggested.

Regarding the overhead introduced with pg_ticks_to_ns:

On master (88327092ff0), I'm getting 23.54 ns from pg_test_timing - vs
with 0002 applied, this slows to 25.74 ns. I've tried to see if the
"unlikely(..)" we added in pg_ticks_to_ns is the problem (since in the
clock_gettime() case we'd always be running into that branch due to
the size of the nanoseconds value), but no luck - I think the extra
multiplication/division itself is the problem.

Any ideas how we could do this differently?

Thanks,
Lukas

--
Lukas Fittl

From 002cea28a3ca14cb9a5f7674201ce3eb2b1ded8e Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 31 Jan 2026 08:49:46 -0800
Subject: [PATCH v6 1/4] Check for HAVE__CPUIDEX and HAVE__GET_CPUID_COUNT
 separately

Previously we would only check for the availability of __cpuidex if
the related __get_cpuid_count was not available on a platform. But there
are cases where we want to be able to call __cpuidex as the only viable
option, specifically, when accessing a high leaf like VM Hypervisor
information (0x40000000), which __get_cpuid_count does not allow.

This will be used in an future commit to access Hypervisor information
about the TSC frequency of x86 CPUs, where available.

Note that __cpuidex is defined in cpuid.h for GCC/clang, but in intrin.h
for MSVC. Because we now set HAVE__CPUIDEX for GCC/clang when available,
adjust existing code to check for _MSC_VER when including intrin.h.

Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
 configure                         | 20 ++++++++++++--------
 configure.ac                      | 30 +++++++++++++++++-------------
 meson.build                       | 10 ++++++++--
 src/port/pg_crc32c_sse42_choose.c |  4 ++--
 src/port/pg_popcount_x86.c        |  4 ++--
 5 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/configure b/configure
index a10a2c85c6a..38de88fcc50 100755
--- a/configure
+++ b/configure
@@ -17648,7 +17648,8 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
   fi
 fi
 
-# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+# Check for __get_cpuid_count() and __cpuidex() separately, since we sometimes
+# need __cpuidex() even if __get_cpuid_count() is available.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
 $as_echo_n "checking for __get_cpuid_count... " >&6; }
 if ${pgac_cv__get_cpuid_count+:} false; then :
@@ -17681,21 +17682,25 @@ if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
 
 $as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
 
-else
-  # __cpuidex()
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+fi
+# __cpuidex()
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
 $as_echo_n "checking for __cpuidex... " >&6; }
 if ${pgac_cv__cpuidex+:} false; then :
   $as_echo_n "(cached) " >&6
 else
   cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
-#include <intrin.h>
+#ifdef _MSC_VER
+    #include <intrin.h>
+    #else
+    #include <cpuid.h>
+    #endif
 int
 main ()
 {
 unsigned int exx[4] = {0, 0, 0, 0};
-    __cpuidex(exx, 7, 0);
+  __cpuidex(exx, 7, 0);
 
   ;
   return 0;
@@ -17711,11 +17716,10 @@ rm -f core conftest.err conftest.$ac_objext \
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
 $as_echo "$pgac_cv__cpuidex" >&6; }
-  if test x"$pgac_cv__cpuidex" = x"yes"; then
+if test x"$pgac_cv__cpuidex" = x"yes"; then
 
 $as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
 
-  fi
 fi
 
 # Check for XSAVE intrinsics
diff --git a/configure.ac b/configure.ac
index 814e64a967e..6e174cba328 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2098,7 +2098,8 @@ else
   fi
 fi
 
-# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+# Check for __get_cpuid_count() and __cpuidex() separately, since we sometimes
+# need __cpuidex() even if __get_cpuid_count() is available.
 AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
 [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
   [[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2108,18 +2109,21 @@ AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
   [pgac_cv__get_cpuid_count="no"])])
 if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
   AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
-else
-  # __cpuidex()
-  AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
-  [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
-    [[unsigned int exx[4] = {0, 0, 0, 0};
-    __cpuidex(exx, 7, 0);
-    ]])],
-    [pgac_cv__cpuidex="yes"],
-    [pgac_cv__cpuidex="no"])])
-  if test x"$pgac_cv__cpuidex" = x"yes"; then
-    AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
-  fi
+fi
+# __cpuidex()
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#ifdef _MSC_VER
+    #include <intrin.h>
+    #else
+    #include <cpuid.h>
+    #endif],
+  [[unsigned int exx[4] = {0, 0, 0, 0};
+  __cpuidex(exx, 7, 0);
+  ]])],
+  [pgac_cv__cpuidex="yes"],
+  [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+  AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
 fi
 
 # Check for XSAVE intrinsics
diff --git a/meson.build b/meson.build
index 96b3869df86..52e77b02656 100644
--- a/meson.build
+++ b/meson.build
@@ -2078,7 +2078,8 @@ elif cc.links('''
 endif
 
 
-# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+# Check for __get_cpuid_count() and __cpuidex() separately, since we sometimes
+# need __cpuidex() even if __get_cpuid_count() is available.
 if cc.links('''
     #include <cpuid.h>
     int main(int arg, char **argv)
@@ -2089,8 +2090,13 @@ if cc.links('''
     ''', name: '__get_cpuid_count',
     args: test_c_args)
   cdata.set('HAVE__GET_CPUID_COUNT', 1)
-elif cc.links('''
+endif
+if cc.links('''
+    #ifdef _MSC_VER
     #include <intrin.h>
+    #else
+    #include <cpuid.h>
+    #endif
     int main(int arg, char **argv)
     {
         unsigned int exx[4] = {0, 0, 0, 0};
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index f586476964f..7a75380b483 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,11 +20,11 @@
 
 #include "c.h"
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
 #include <cpuid.h>
 #endif
 
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
 #include <intrin.h>
 #endif
 
diff --git a/src/port/pg_popcount_x86.c b/src/port/pg_popcount_x86.c
index 245f0167d00..f8a20766f2d 100644
--- a/src/port/pg_popcount_x86.c
+++ b/src/port/pg_popcount_x86.c
@@ -14,7 +14,7 @@
 
 #ifdef HAVE_X86_64_POPCNTQ
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
 #include <cpuid.h>
 #endif
 
@@ -22,7 +22,7 @@
 #include <immintrin.h>
 #endif
 
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
 #include <intrin.h>
 #endif
 
-- 
2.47.1

From 0fb934a9a6246a102226eb14fa0a306127a63410 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Fri, 25 Jul 2025 17:57:20 -0700
Subject: [PATCH v6 2/4] Timing: Always perform ticks to nanosecond conversion

The timing infrastructure (INSTR_* macros) measures time elapsed using
clock_gettime() on POSIX systems, which returns the time as nanoseconds,
and QueryPerformanceCounter() on Windows, which is a specialized timing
clock source that returns a tick counter that needs to be converted to
nanoseconds using the result of QueryPerformanceFrequency().

This conversion currently happens ad-hoc on Windows, and calls
QueryPerformanceFrequency() on every INSTR_TIME_GET_* invocation, despite
the frequency being stable after program start, incurring unnecessary
overhead. It also causes a fractured implementation where macros are
defined differently between platforms.

To ease code readability, and prepare for a future change that intends
to use a ticks-to-nanosecond conversion on other platforms, introduce
a new pg_ticks_to_ns() function that gets called on all platforms.

This function relies on a separately initialized ticks_per_ns_scaled
value, that represents the conversion ratio. This value is initialized
from QueryPerformanceFrequency() on Windows, and set to a fixed value on
POSIX systems, that effectively results in returning the internal ticks
counter as nanoseconds.

To support this, pg_initialize_timing() is introduced, and is now
mandatory for both the backend and any frontend programs to call before
utilizing INSTR_* macros.

Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
 src/backend/postmaster/postmaster.c     |  3 +
 src/bin/pg_test_timing/pg_test_timing.c |  3 +
 src/bin/pgbench/pgbench.c               |  3 +
 src/bin/psql/startup.c                  |  4 ++
 src/common/Makefile                     |  1 +
 src/common/instr_time.c                 | 89 +++++++++++++++++++++++
 src/common/meson.build                  |  1 +
 src/include/portability/instr_time.h    | 94 +++++++++++++++++--------
 8 files changed, 167 insertions(+), 31 deletions(-)
 create mode 100644 src/common/instr_time.c

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d6133bfebc6..0ee2e67a30a 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -588,6 +588,9 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitializeGUCOptions();
 
+	/* initialize timing infrastructure (required for INSTR_* calls) */
+	pg_initialize_timing();
+
 	opterr = 1;
 
 	/*
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index a5621251afc..fee2911df15 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -43,6 +43,9 @@ main(int argc, char *argv[])
 
 	handle_args(argc, argv);
 
+	/* initialize timing infrastructure (required for INSTR_* calls) */
+	pg_initialize_timing();
+
 	loop_count = test_timing(test_duration);
 
 	output(loop_count);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 58735871c17..16f7790680b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7334,6 +7334,9 @@ main(int argc, char **argv)
 		initRandomState(&state[i].cs_func_rs);
 	}
 
+	/* initialize timing infrastructure (required for INSTR_* calls) */
+	pg_initialize_timing();
+
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 9a397ec87b7..69d044d405d 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
 #include "help.h"
 #include "input.h"
 #include "mainloop.h"
+#include "portability/instr_time.h"
 #include "settings.h"
 
 /*
@@ -327,6 +328,9 @@ main(int argc, char *argv[])
 
 	PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
 
+	/* initialize timing infrastructure (required for INSTR_* calls) */
+	pg_initialize_timing();
+
 	SyncVariables();
 
 	if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 2c720caa509..1a2fbbe887f 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
 	file_perm.o \
 	file_utils.o \
 	hashfn.o \
+	instr_time.o \
 	ip.o \
 	jsonapi.o \
 	keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 00000000000..3a6f41b98e1
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,89 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ *	   Non-inline parts of the portable high-precision interval timing
+ *	 implementation
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+/*
+ * Stores what the number of ticks needs to be multiplied with to end up
+ * with nanoseconds using integer math.
+ *
+ * On certain platforms (currently Windows) the ticks to nanoseconds conversion
+ * requires floating point math because:
+ *
+ * sec = ticks / frequency_hz
+ * ns  = ticks / frequency_hz * 1,000,000,000
+ * ns  = ticks * (1,000,000,000 / frequency_hz)
+ * ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a 2.5 GHz CPU
+ * the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of precision. We
+ * first scale the integer up and after the multiplication by the number
+ * of ticks in INSTR_TIME_GET_NANOSEC() we divide again by the same value.
+ * We picked the scaler such that it provides enough precision and is a
+ * power-of-two which allows for shifting instead of doing an integer
+ * division.
+ *
+ * On all other platforms we are using clock_gettime(), which uses nanoseconds
+ * as ticks. Hence, we set the multiplier to the precision scalar so that the
+ * division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ */
+int64		ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64		max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+static void set_ticks_per_ns(void);
+
+void
+pg_initialize_timing()
+{
+	set_ticks_per_ns();
+}
+
+#ifndef WIN32
+
+static int64
+ticks_per_ns_for_system()
+{
+	return TICKS_TO_NS_PRECISION;
+}
+
+#else							/* WIN32 */
+
+/* GetTimerFrequency returns counts per second */
+static inline double
+GetTimerFrequency(void)
+{
+	LARGE_INTEGER f;
+
+	QueryPerformanceFrequency(&f);
+	return (double) f.QuadPart;
+}
+
+static int64
+ticks_per_ns_for_system()
+{
+	return INT64CONST(1000000000) * TICKS_TO_NS_PRECISION / GetTimerFrequency();
+}
+
+#endif							/* WIN32 */
+
+static void
+set_ticks_per_ns()
+{
+	ticks_per_ns_scaled = ticks_per_ns_for_system();
+	max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+}
diff --git a/src/common/meson.build b/src/common/meson.build
index b757618a9c9..042edb7473a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -13,6 +13,7 @@ common_sources = files(
   'file_perm.c',
   'file_utils.c',
   'hashfn.c',
+  'instr_time.c',
   'ip.c',
   'jsonapi.c',
   'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 490593d1825..ed6e52ef84e 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -78,11 +78,29 @@ typedef struct instr_time
 #define NS_PER_MS	INT64CONST(1000000)
 #define NS_PER_US	INT64CONST(1000)
 
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
 
-#ifndef WIN32
+/*
+ * Variables used to translate ticks to nanoseconds, initialized by
+ * pg_initialize_timing.
+ */
+extern PGDLLIMPORT int64 ticks_per_ns_scaled;
+extern PGDLLIMPORT int64 max_ticks_no_overflow;
 
+/*
+ * Initialize timing infrastructure
+ *
+ * This must be called at least once before using INSTR_TIME_SET_CURRENT* macros.
+ */
+extern void pg_initialize_timing(void);
 
-/* Use clock_gettime() */
+#ifndef WIN32
+
+/* On POSIX, use clock_gettime() for system clock source */
 
 #include <time.h>
 
@@ -106,9 +124,8 @@ typedef struct instr_time
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-/* helper for INSTR_TIME_SET_CURRENT */
 static inline instr_time
-pg_clock_gettime_ns(void)
+pg_get_ticks(void)
 {
 	instr_time	now;
 	struct timespec tmp;
@@ -119,21 +136,12 @@ pg_clock_gettime_ns(void)
 	return now;
 }
 
-#define INSTR_TIME_SET_CURRENT(t) \
-	((t) = pg_clock_gettime_ns())
-
-#define INSTR_TIME_GET_NANOSEC(t) \
-	((int64) (t).ticks)
-
-
 #else							/* WIN32 */
 
+/* On Windows, use QueryPerformanceCounter() for system clock source */
 
-/* Use QueryPerformanceCounter() */
-
-/* helper for INSTR_TIME_SET_CURRENT */
 static inline instr_time
-pg_query_performance_counter(void)
+pg_get_ticks(void)
 {
 	instr_time	now;
 	LARGE_INTEGER tmp;
@@ -144,23 +152,43 @@ pg_query_performance_counter(void)
 	return now;
 }
 
-static inline double
-GetTimerFrequency(void)
-{
-	LARGE_INTEGER f;
-
-	QueryPerformanceFrequency(&f);
-	return (double) f.QuadPart;
-}
-
-#define INSTR_TIME_SET_CURRENT(t) \
-	((t) = pg_query_performance_counter())
-
-#define INSTR_TIME_GET_NANOSEC(t) \
-	((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
-
 #endif							/* WIN32 */
 
+static inline int64_t
+pg_ticks_to_ns(int64 ticks)
+{
+	/*
+	 * Would multiplication overflow? If so perform computation in two parts.
+	 * Check overflow without actually overflowing via: a * b > max <=> a >
+	 * max / b
+	 */
+	int64		ns = 0;
+
+	if (unlikely(ticks > max_ticks_no_overflow))
+	{
+		/*
+		 * Compute how often the maximum number of ticks fits completely into
+		 * the number of elapsed ticks and convert that number into
+		 * nanoseconds. Then multiply by the count to arrive at the final
+		 * value. In a 2nd step we adjust the number of elapsed ticks and
+		 * convert the remaining ticks.
+		 */
+		int64		count = ticks / max_ticks_no_overflow;
+		int64		max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+
+		ns = max_ns * count;
+
+		/*
+		 * Subtract the ticks that we now already accounted for, so that they
+		 * don't get counted twice.
+		 */
+		ticks -= count * max_ticks_no_overflow;
+		Assert(ticks >= 0);
+	}
+
+	ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+	return ns;
+}
 
 /*
  * Common macros
@@ -168,12 +196,13 @@ GetTimerFrequency(void)
 
 #define INSTR_TIME_IS_ZERO(t)	((t).ticks == 0)
 
-
 #define INSTR_TIME_SET_ZERO(t)	((t).ticks = 0)
 
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+#define INSTR_TIME_SET_CURRENT(t) \
+	((t) = pg_get_ticks())
 
 #define INSTR_TIME_ADD(x,y) \
 	((x).ticks += (y).ticks)
@@ -187,6 +216,9 @@ GetTimerFrequency(void)
 #define INSTR_TIME_LT(x,y) \
 	((x).ticks > (y).ticks)
 
+#define INSTR_TIME_GET_NANOSEC(t) \
+	(pg_ticks_to_ns((t).ticks))
+
 #define INSTR_TIME_GET_DOUBLE(t) \
 	((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
 
-- 
2.47.1

From b5efb122f55701e386a2cd808c30b7f7db83c708 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 12 Feb 2026 01:12:19 -0800
Subject: [PATCH v6 4/4] pg_test_timing: Also test RDTSC/RDTSCP timing and
 report time source

In passing also reduce the per-loop overhead caused by repeated divisions
in INSTR_TIME_GET_NANOSEC when the ticks variable has become very large
instead diff first and then turn it into nanosecs.

Author: David Geier <[email protected]>
Author: Andres Freund <[email protected]>
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
 src/bin/pg_test_timing/pg_test_timing.c | 96 ++++++++++++++++++++-----
 src/include/portability/instr_time.h    |  9 +++
 2 files changed, 88 insertions(+), 17 deletions(-)

diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index fee2911df15..6e630892b8d 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -30,7 +30,7 @@ static long long int largest_diff_count;
 
 
 static void handle_args(int argc, char *argv[]);
-static uint64 test_timing(unsigned int duration);
+static uint64 test_timing(unsigned int duration, TimingClockSourceType source, bool fast_timing);
 static void output(uint64 loop_count);
 
 int
@@ -46,10 +46,47 @@ main(int argc, char *argv[])
 	/* initialize timing infrastructure (required for INSTR_* calls) */
 	pg_initialize_timing();
 
-	loop_count = test_timing(test_duration);
-
+	/*
+	 * First, test default (non-fast) timing code. A clock source for that is
+	 * always available. Hence, we can unconditionally output the result.
+	 */
+	loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_SYSTEM, false);
 	output(loop_count);
 
+#if defined(__x86_64__)
+
+	/*
+	 * If on a supported architecture, test the RDTSC clock source. This clock
+	 * source is not always available. In that case the loop count will be 0
+	 * and we don't print.
+	 *
+	 * We first emit RDTSCP timings, which is slower, and gets used for higher
+	 * precision measurements when the TSC clock source is enabled. We emit
+	 * RDTSC second, which is used for faster timing measurements with lower
+	 * precision.
+	 */
+	printf("\n");
+	loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, false);
+	if (loop_count > 0)
+	{
+		output(loop_count);
+		printf("\n");
+
+		/* Now, emit fast timing measurements */
+		loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, true);
+		output(loop_count);
+		printf("\n");
+
+		pg_set_timing_clock_source(TIMING_CLOCK_SOURCE_AUTO);
+		if (pg_current_timing_clock_source() == TIMING_CLOCK_SOURCE_TSC)
+			printf(_("TSC clock source will be used by default, unless timing_clock_source is set to 'system'.\n"));
+		else
+			printf(_("TSC clock source will not be used by default, unless timing_clock_source is set to 'tsc'.\n"));
+	}
+	else
+		printf(_("TSC clock source is not usable. Likely unable to determine TSC frequency. are you running in an unsupported virtualized environment?.\n"));
+#endif
+
 	return 0;
 }
 
@@ -146,23 +183,41 @@ handle_args(int argc, char *argv[])
 		exit(1);
 	}
 
-	printf(ngettext("Testing timing overhead for %u second.\n",
-					"Testing timing overhead for %u seconds.\n",
+	printf(ngettext("Testing timing overhead for %u second.\n\n",
+					"Testing timing overhead for %u seconds.\n\n",
 					test_duration),
 		   test_duration);
 }
 
 static uint64
-test_timing(unsigned int duration)
+test_timing(unsigned int duration, TimingClockSourceType source, bool fast_timing)
 {
 	uint64		total_time;
 	int64		time_elapsed = 0;
 	uint64		loop_count = 0;
-	uint64		prev,
-				cur;
 	instr_time	start_time,
 				end_time,
-				temp;
+				prev,
+				cur;
+	char	   *time_source = NULL;
+
+	if (!pg_set_timing_clock_source(source))
+		return 0;
+
+#if defined(__x86_64__)
+	if (pg_current_timing_clock_source() == TIMING_CLOCK_SOURCE_TSC)
+		time_source = fast_timing ? "RDTSC" : "RDTSCP";
+	else
+		time_source = PG_INSTR_SYSTEM_CLOCK_NAME;
+#else
+	time_source = PG_INSTR_SYSTEM_CLOCK_NAME;
+#endif
+	if (fast_timing)
+		printf(_("Fast clock source: %s\n"), time_source);
+	else if (source == TIMING_CLOCK_SOURCE_SYSTEM)
+		printf(_("System clock source: %s\n"), time_source);
+	else
+		printf(_("Clock source: %s\n"), time_source);
 
 	/*
 	 * Pre-zero the statistics data structures.  They're already zero by
@@ -176,8 +231,11 @@ test_timing(unsigned int duration)
 
 	total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
 
-	INSTR_TIME_SET_CURRENT(start_time);
-	cur = INSTR_TIME_GET_NANOSEC(start_time);
+	if (fast_timing)
+		INSTR_TIME_SET_CURRENT_FAST(start_time);
+	else
+		INSTR_TIME_SET_CURRENT(start_time);
+	cur = start_time;
 
 	while (time_elapsed < total_time)
 	{
@@ -185,9 +243,11 @@ test_timing(unsigned int duration)
 					bits;
 
 		prev = cur;
-		INSTR_TIME_SET_CURRENT(temp);
-		cur = INSTR_TIME_GET_NANOSEC(temp);
-		diff = cur - prev;
+		if (fast_timing)
+			INSTR_TIME_SET_CURRENT_FAST(cur);
+		else
+			INSTR_TIME_SET_CURRENT(cur);
+		diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
 
 		/* Did time go backwards? */
 		if (unlikely(diff < 0))
@@ -220,11 +280,13 @@ test_timing(unsigned int duration)
 			largest_diff_count++;
 
 		loop_count++;
-		INSTR_TIME_SUBTRACT(temp, start_time);
-		time_elapsed = INSTR_TIME_GET_NANOSEC(temp);
+		time_elapsed = INSTR_TIME_DIFF_NANOSEC(cur, start_time);
 	}
 
-	INSTR_TIME_SET_CURRENT(end_time);
+	if (fast_timing)
+		INSTR_TIME_SET_CURRENT_FAST(end_time);
+	else
+		INSTR_TIME_SET_CURRENT(end_time);
 
 	INSTR_TIME_SUBTRACT(end_time, start_time);
 
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e7cc251b3d2..41a2225fe6e 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -28,6 +28,8 @@
  *
  * INSTR_TIME_SUBTRACT(x, y)		x -= y
  *
+ * INSTR_TIME_DIFF_NANOSEC(x, y)	x - y (in nanoseconds)
+ *
  * INSTR_TIME_ACCUM_DIFF(x, y, z)	x += (y - z)
  *
  * INSTR_TIME_GET_DOUBLE(t)			convert t to double (in seconds)
@@ -164,10 +166,13 @@ extern bool pg_set_timing_clock_source(TimingClockSourceType source);
  */
 #if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
 #define PG_INSTR_SYSTEM_CLOCK	CLOCK_MONOTONIC_RAW
+#define PG_INSTR_SYSTEM_CLOCK_NAME	"clock_gettime (CLOCK_MONOTONIC_RAW)"
 #elif defined(CLOCK_MONOTONIC)
 #define PG_INSTR_SYSTEM_CLOCK	CLOCK_MONOTONIC
+#define PG_INSTR_SYSTEM_CLOCK_NAME	"clock_gettime (CLOCK_MONOTONIC)"
 #else
 #define PG_INSTR_SYSTEM_CLOCK	CLOCK_REALTIME
+#define PG_INSTR_SYSTEM_CLOCK_NAME	"clock_gettime (CLOCK_REALTIME)"
 #endif
 
 static inline instr_time
@@ -186,6 +191,7 @@ pg_get_ticks_system(void)
 
 /* On Windows, use QueryPerformanceCounter() for system clock source */
 
+#define PG_INSTR_SYSTEM_CLOCK_NAME	"QueryPerformanceCounter"
 static inline instr_time
 pg_get_ticks_system(void)
 {
@@ -289,6 +295,9 @@ pg_get_ticks(void)
 #define INSTR_TIME_SUBTRACT(x,y) \
 	((x).ticks -= (y).ticks)
 
+#define INSTR_TIME_DIFF_NANOSEC(x,y) \
+	(pg_ticks_to_ns((x).ticks - (y).ticks))
+
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	((x).ticks += (y).ticks - (z).ticks)
 
-- 
2.47.1

From cad5f3351f16d26b2e12dfaa0ea1ffe7dbddd610 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 12 Feb 2026 01:09:48 -0800
Subject: [PATCH v6 3/4] Timing: Use Time-Stamp Counter (TSC) on x86-64 for
 faster measurements

This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSC instructions, instead of APIs like
clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with
ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be
reduced by up to 10% for queries moving lots of rows through the plan.

To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when running on Linux/x86-64,
in case the system clocksource is reported as "tsc". The use of the system
APIs can be enforced by setting "system", or on x86-64 architectures the
use of TSC can be enforced by explicitly setting "tsc".

Note, that we further split the use of the TSC into the RDTSC CPU instruction
which does not wait for out-of-order execution (faster, less precise)
and the RDTSCP instruction, which waits for outstanding instructions to
retire. RDTSCP is deemed to have little benefit in the typical
InstrStartNode() / InstrStopNode() use case of EXPLAIN, and can be up to
twice as slow. To separate these use cases, the new macro
INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC.

The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed
to be used when precision is more important than performance. When the
system timing clock source is used both of these macros instead utilize
the system APIs (clock_gettime / QueryPerformanceCounter) like before.

Author: David Geier <[email protected]>
Author: Andres Freund <[email protected]>
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
 doc/src/sgml/config.sgml                      |  55 ++++
 src/backend/executor/instrument.c             |  12 +-
 src/backend/postmaster/postmaster.c           |   3 -
 src/backend/utils/misc/guc_parameters.dat     |  11 +
 src/backend/utils/misc/guc_tables.c           |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/common/instr_time.c                       | 282 +++++++++++++++++-
 src/include/portability/instr_time.h          | 109 ++++++-
 src/include/utils/guc_hooks.h                 |   3 +
 src/include/utils/guc_tables.h                |   1 +
 10 files changed, 466 insertions(+), 25 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6bc2690ce07..a23f4e11b44 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2523,6 +2523,61 @@ include_dir 'conf.d'
      </variablelist>
     </sect2>
 
+    <sect2 id="runtime-config-resource-time">
+     <title>Timing</title>
+
+     <variablelist>
+     <varlistentry id="guc-timing-clock-source" xreflabel="timing_clock_source">
+      <term><varname>timing_clock_source</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>timing_clock_source</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Selects the method for making timing measurements using the OS or specialized CPU
+        instructions. Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>auto</literal> (automatically chooses TSC clock source for Linux-based
+            x86-64 systems that utilize "tsc" as their system clock source, otherwise uses
+            the OS system clock)
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>system</literal> (measures timing using the OS system clock)
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>tsc</literal> (measures timing using the x86-64 Time-Stamp Counter (TSC)
+            by directly executing RDTSC/RDTSCP instructions, see below)
+           </para>
+          </listitem>
+         </itemizedlist>
+         The default is <literal>auto</literal>.
+        </para>
+        <para>
+          If enabled, the TSC clock source will use the RDTSC instruction for the x86-64
+          Time-Stamp Counter (TSC) to perform certain time measurements, for example during
+          EXPLAIN ANALYZE. The RDTSC instruction has less overhead than going through the OS
+          clock source, which for an EXPLAIN ANALYZE statement will show timing closer to the
+          actual runtime when timing is off. For timings that require higher precision the
+          RDTSCP instruction is used, which avoids inaccuracies due to CPU instruction re-ordering.
+          Use of RDTSC/RDTSC is not supported on Windows or on other architectures, and is not
+          advised on systems that utilize an emulated TSC.
+        </para>
+        <para>
+          To help decide which clock source to use on an x86-64 system you can run the
+          <application>pg_test_timing</application> utility to check TSC availability, and
+          perform timing measurements.
+        </para>
+      </listitem>
+     </varlistentry>
+     </variablelist>
+    </sect2>
 
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index edab92a0ebe..ebdad31ca3b 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
 void
 InstrStartNode(Instrumentation *instr)
 {
-	if (instr->need_timer &&
-		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
-		elog(ERROR, "InstrStartNode called twice in a row");
+	if (instr->need_timer)
+	{
+		if (!INSTR_TIME_IS_ZERO(instr->starttime))
+			elog(ERROR, "InstrStartNode called twice in a row");
+		else
+			INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+	}
 
 	/* save buffer usage totals at node entry, if needed */
 	if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 		if (INSTR_TIME_IS_ZERO(instr->starttime))
 			elog(ERROR, "InstrStopNode called without start");
 
-		INSTR_TIME_SET_CURRENT(endtime);
+		INSTR_TIME_SET_CURRENT_FAST(endtime);
 		INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
 
 		INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0ee2e67a30a..d6133bfebc6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -588,9 +588,6 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitializeGUCOptions();
 
-	/* initialize timing infrastructure (required for INSTR_* calls) */
-	pg_initialize_timing();
-
 	opterr = 1;
 
 	/*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 271c033952e..7828c8dcbf5 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2988,6 +2988,17 @@
   assign_hook => 'assign_timezone_abbreviations',
 },
 
+{ name => 'timing_clock_source', type => 'enum', context => 'PGC_USERSET', group => 'RESOURCES_TIME',
+  short_desc => 'Controls the clock source used for collecting timing measurements.',
+  long_desc => 'This enables the use of specialized clock sources, specifically the RDTSC clock source on x86-64 systems (if available), to support timing measurements with lower overhead during EXPLAIN and other instrumentation.',
+  variable => 'timing_clock_source',
+  boot_val => 'TIMING_CLOCK_SOURCE_AUTO',
+  options => 'timing_clock_source_options',
+  check_hook => 'check_timing_clock_source',
+  assign_hook => 'assign_timing_clock_source',
+  show_hook => 'show_timing_clock_source',
+},
+
 { name => 'trace_connection_negotiation', type => 'bool', context => 'PGC_POSTMASTER', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Logs details of pre-authentication connection handshake.',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 741fce8dede..241dcb2810d 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -91,6 +91,7 @@
 #include "tcop/tcopprot.h"
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
+#include "portability/instr_time.h"
 #include "utils/bytea.h"
 #include "utils/float.h"
 #include "utils/guc_hooks.h"
@@ -372,6 +373,15 @@ static const struct config_enum_entry huge_pages_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry timing_clock_source_options[] = {
+	{"auto", TIMING_CLOCK_SOURCE_AUTO, false},
+	{"system", TIMING_CLOCK_SOURCE_SYSTEM, false},
+#if defined(__x86_64__)
+	{"tsc", TIMING_CLOCK_SOURCE_TSC, false},
+#endif
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry huge_pages_status_options[] = {
 	{"off", HUGE_PAGES_OFF, false},
 	{"on", HUGE_PAGES_ON, false},
@@ -722,6 +732,7 @@ const char *const config_group_names[] =
 	[CONN_AUTH_TCP] = gettext_noop("Connections and Authentication / TCP Settings"),
 	[CONN_AUTH_AUTH] = gettext_noop("Connections and Authentication / Authentication"),
 	[CONN_AUTH_SSL] = gettext_noop("Connections and Authentication / SSL"),
+	[RESOURCES_TIME] = gettext_noop("Resource Usage / Time"),
 	[RESOURCES_MEM] = gettext_noop("Resource Usage / Memory"),
 	[RESOURCES_DISK] = gettext_noop("Resource Usage / Disk"),
 	[RESOURCES_KERNEL] = gettext_noop("Resource Usage / Kernel Resources"),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f938cc65a3a..c11d88348f0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -193,6 +193,10 @@
 #max_files_per_process = 1000           # min 64
                                         # (change requires restart)
 
+# - Time -
+
+#timing_clock_source = auto             # auto, system, tsc (if supported)
+
 # - Background Writer -
 
 #bgwriter_delay = 200ms                 # 10-10000ms between rounds
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
index 3a6f41b98e1..bbe5968d974 100644
--- a/src/common/instr_time.c
+++ b/src/common/instr_time.c
@@ -20,8 +20,8 @@
  * Stores what the number of ticks needs to be multiplied with to end up
  * with nanoseconds using integer math.
  *
- * On certain platforms (currently Windows) the ticks to nanoseconds conversion
- * requires floating point math because:
+ * In certain cases (TSC on x86-64, and QueryPerformanceCounter on Windows)
+ * the ticks to nanoseconds conversion requires floating point math because:
  *
  * sec = ticks / frequency_hz
  * ns  = ticks / frequency_hz * 1,000,000,000
@@ -38,7 +38,7 @@
  * power-of-two which allows for shifting instead of doing an integer
  * division.
  *
- * On all other platforms we are using clock_gettime(), which uses nanoseconds
+ * In all other cases we are using clock_gettime(), which uses nanoseconds
  * as ticks. Hence, we set the multiplier to the precision scalar so that the
  * division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
  */
@@ -47,12 +47,53 @@ int64		max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
 
 static void set_ticks_per_ns(void);
 
+int			timing_clock_source = TIMING_CLOCK_SOURCE_AUTO;
+
+#if defined(__x86_64__)
+/* Indicates if TSC instructions (RDTSC and RDTSCP) are usable. */
+extern bool has_usable_tsc;
+
+static void tsc_initialize(void);
+static bool tsc_use_by_default(void);
+static int64 ticks_per_ns_for_tsc(void);
+static bool set_tsc_frequency_khz(void);
+static bool is_rdtscp_available(void);
+#endif
+
 void
 pg_initialize_timing()
 {
+#if defined(__x86_64__)
+	tsc_initialize();
+#endif
+
 	set_ticks_per_ns();
 }
 
+bool
+pg_set_timing_clock_source(TimingClockSourceType source)
+{
+#if defined(__x86_64__)
+	switch (source)
+	{
+		case TIMING_CLOCK_SOURCE_AUTO:
+			use_tsc = has_usable_tsc && tsc_use_by_default();
+			break;
+		case TIMING_CLOCK_SOURCE_SYSTEM:
+			use_tsc = false;
+			break;
+		case TIMING_CLOCK_SOURCE_TSC:
+			if (!has_usable_tsc)	/* Tell caller TSC is not usable */
+				return false;
+			use_tsc = true;
+			break;
+	}
+#endif
+	set_ticks_per_ns();
+	timing_clock_source = source;
+	return true;
+}
+
 #ifndef WIN32
 
 static int64
@@ -84,6 +125,241 @@ ticks_per_ns_for_system()
 static void
 set_ticks_per_ns()
 {
+#if defined(__x86_64__)
+	if (use_tsc)
+		ticks_per_ns_scaled = ticks_per_ns_for_tsc();
+	else
+		ticks_per_ns_scaled = ticks_per_ns_for_system();
+#else
 	ticks_per_ns_scaled = ticks_per_ns_for_system();
+#endif
 	max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
 }
+
+/* GUC handling */
+
+#ifndef FRONTEND
+
+#include "utils/guc_hooks.h"
+
+bool
+check_timing_clock_source(int *newval, void **extra, GucSource source)
+{
+#if defined(__x86_64__)
+	pg_initialize_timing();
+
+	if (*newval == TIMING_CLOCK_SOURCE_TSC && !has_usable_tsc)
+	{
+		GUC_check_errdetail("TSC is not supported as fast clock source");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_timing_clock_source(int newval, void *extra)
+{
+	/*
+	 * Ignore the return code since the check hook already verified TSC is
+	 * usable if its explicitly requested
+	 */
+	pg_set_timing_clock_source(newval);
+}
+
+const char *
+show_timing_clock_source()
+{
+#if defined(__x86_64__)
+	TimingClockSourceType effective_source = pg_current_timing_clock_source();
+
+	switch (timing_clock_source)
+	{
+		case TIMING_CLOCK_SOURCE_AUTO:
+			if (effective_source == TIMING_CLOCK_SOURCE_TSC)
+				return "auto (tsc)";
+			else
+				return "auto (system)";
+		case TIMING_CLOCK_SOURCE_SYSTEM:
+			return "system";
+		case TIMING_CLOCK_SOURCE_TSC:
+			return "tsc";
+	}
+#else
+	switch (timing_clock_source)
+	{
+		case TIMING_CLOCK_SOURCE_AUTO:
+			return "auto (system)";
+		case TIMING_CLOCK_SOURCE_SYSTEM:
+			return "system";
+	}
+#endif
+
+	/* unreachable */
+	return "?";
+}
+
+#endif							/* !FRONTEND */
+
+/* TSC specific logic */
+
+#if defined(__x86_64__)
+
+#if defined(HAVE__GET_CPUID) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
+#include <cpuid.h>
+#endif
+
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
+#include <intrin.h>
+#endif
+
+bool		has_usable_tsc = false;
+bool		use_tsc = false;
+
+static uint32 tsc_frequency_khz = 0;
+
+/*
+ * Decide whether we use the RDTSC/RDTSCP instructions at runtime, for Linux/x86-64,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ */
+static void
+tsc_initialize(void)
+{
+	/*
+	 * Compute baseline CPU peformance, determines speed at which the TSC
+	 * advances.
+	 */
+	if (!set_tsc_frequency_khz())
+		return;
+
+	has_usable_tsc = is_rdtscp_available();
+}
+
+/*
+ * Decides whether to use TSC clock source if the user did not specify it
+ * one way or the other, and it is available (checked separately).
+ *
+ * Currently only enabled by default on Linux, since Linux already does a
+ * significant amount of work to determine whether TSC is a viable clock
+ * source.
+ */
+static bool
+tsc_use_by_default()
+{
+#if defined(__linux__)
+	FILE	   *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+	char		buf[128];
+
+	if (!fp)
+		return false;
+
+	if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+	{
+		fclose(fp);
+		return true;
+	}
+
+	fclose(fp);
+#endif
+
+	return false;
+}
+
+static int64
+ticks_per_ns_for_tsc()
+{
+	return INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_frequency_khz;
+}
+
+#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
+#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 0x564b4d56 && words[3] == 0x0000004d)	/* KVMKVMKVM */
+
+static bool
+set_tsc_frequency_khz()
+{
+	uint32		r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , &r[2] /* hz */ , &r[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(r, 0x15);
+#else
+#error cpuid instruction not available
+#endif
+
+	if (r[2] > 0)
+	{
+		if (r[0] == 0 || r[1] == 0)
+			return false;
+
+		tsc_frequency_khz = r[2] / 1000 * r[1] / r[0];
+		return true;
+	}
+
+	/* Some CPUs only report frequency in 16H */
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(r, 0x16);
+#else
+#error cpuid instruction not available
+#endif
+
+	if (r[0] > 0)
+	{
+		tsc_frequency_khz = r[0] * 1000;
+		return true;
+	}
+
+	/*
+	 * Check if we have a KVM or VMware Hypervisor passing down TSC frequency
+	 * to us in a guest VM
+	 *
+	 * Note that accessing the 0x40000000 leaf for Hypervisor info requires
+	 * use of __cpuidex to set ECX to 0. The similar __get_cpuid_count
+	 * function does not work as expected since it contains a check for
+	 * __get_cpuid_max, which has been observed to be lower than the special
+	 * Hypervisor leaf.
+	 */
+#if defined(HAVE__CPUIDEX)
+	__cpuidex((int32 *) r, 0x40000000, 0);
+	if (r[0] >= 0x40000010 && (CPUID_HYPERVISOR_VMWARE(r) || CPUID_HYPERVISOR_KVM(r)))
+	{
+		__cpuidex((int32 *) r, 0x40000010, 0);
+		if (r[0] > 0)
+		{
+			tsc_frequency_khz = r[0];
+			return true;
+		}
+	}
+#endif
+
+	return false;
+}
+
+static bool
+is_rdtscp_available()
+{
+	uint32		r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+	if (!__get_cpuid(0x80000001, &r[0], &r[1], &r[2], &r[3]))
+		return false;
+#elif defined(HAVE__CPUID)
+	__cpuid(r, 0x80000001);
+#else
+#error cpuid instruction not available
+#endif
+
+	return (r[3] & (1 << 27)) != 0;
+}
+
+#endif							/* defined(__x86_64__) */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index ed6e52ef84e..e7cc251b3d2 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,10 @@
  *	  portable high-precision interval timing
  *
  * This file provides an abstraction layer to hide portability issues in
- * interval timing.  On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter().  These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On x86 we use the RDTSC/RDTSCP instruction directly in
+ * certain cases, or alternatively clock_gettime() on Unix-like systems and
+ * QueryPerformanceCounter() on Windows. These macros also give some breathing
+ * room to use other high-precision-timing APIs.
  *
  * The basic data type is instr_time, which all callers should treat as an
  * opaque typedef.  instr_time can store either an absolute time (of
@@ -17,10 +18,11 @@
  *
  * INSTR_TIME_SET_ZERO(t)			set t to zero (memset is acceptable too)
  *
- * INSTR_TIME_SET_CURRENT(t)		set t to current time
+ * INSTR_TIME_SET_CURRENT_FAST(t)	set t to current time without waiting
+ * 									for instructions in out-of-order window
  *
- * INSTR_TIME_SET_CURRENT_LAZY(t)	set t to current time if t is zero,
- *									evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT(t)		set t to current time while waiting for
+ * 									instructions in OOO to retire
  *
  * INSTR_TIME_ADD(x, y)				x += y
  *
@@ -91,13 +93,54 @@ typedef struct instr_time
 extern PGDLLIMPORT int64 ticks_per_ns_scaled;
 extern PGDLLIMPORT int64 max_ticks_no_overflow;
 
+#if defined(__x86_64__)
+#include <x86intrin.h>
+
+/* Whether to actually use RDTSC/RDTSCP based on availability and GUC settings. */
+extern PGDLLIMPORT bool use_tsc;
+#endif
+
+typedef enum
+{
+	TIMING_CLOCK_SOURCE_AUTO,
+	TIMING_CLOCK_SOURCE_SYSTEM,
+	TIMING_CLOCK_SOURCE_TSC
+}			TimingClockSourceType;
+
+extern int	timing_clock_source;
+
+/*
+ * Returns the current timing clock source effectively in use, resolving
+ * TIMING_CLOCK_SOURCE_AUTO to either TIMING_CLOCK_SOURCE_SYSTEM or
+ * TIMING_CLOCK_SOURCE_TSC.
+ */
+static inline TimingClockSourceType pg_current_timing_clock_source(void)
+{
+#if defined(__x86_64__)
+	return use_tsc ? TIMING_CLOCK_SOURCE_TSC : TIMING_CLOCK_SOURCE_SYSTEM;
+#else
+	return TIMING_CLOCK_SOURCE_SYSTEM;
+#endif
+}
+
 /*
  * Initialize timing infrastructure
  *
- * This must be called at least once before using INSTR_TIME_SET_CURRENT* macros.
+ * This must be called at least once by frontend programs before using
+ * INSTR_TIME_SET_CURRENT* macros. Backend programs automatically initialize
+ * this through the GUC check hook.
  */
 extern void pg_initialize_timing(void);
 
+/*
+ * Sets the time source to be used. Mainly intended for frontend programs,
+ * the backend should set it via the timing_clock_source GUC instead.
+ *
+ * Returns false if the clock source could not be set, for example when TSC
+ * is not available despite being explicitly set.
+ */
+extern bool pg_set_timing_clock_source(TimingClockSourceType source);
+
 #ifndef WIN32
 
 /* On POSIX, use clock_gettime() for system clock source */
@@ -115,22 +158,25 @@ extern void pg_initialize_timing(void);
  * than CLOCK_MONOTONIC.  In particular, as of macOS 10.12, Apple provides
  * CLOCK_MONOTONIC_RAW which is both faster to read and higher resolution than
  * their version of CLOCK_MONOTONIC.
+ *
+ * Note this does not get used in case the TSC clock source logic is used,
+ * which directly calls architecture specific timing instructions (e.g. RDTSC).
  */
 #if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
-#define PG_INSTR_CLOCK	CLOCK_MONOTONIC_RAW
+#define PG_INSTR_SYSTEM_CLOCK	CLOCK_MONOTONIC_RAW
 #elif defined(CLOCK_MONOTONIC)
-#define PG_INSTR_CLOCK	CLOCK_MONOTONIC
+#define PG_INSTR_SYSTEM_CLOCK	CLOCK_MONOTONIC
 #else
-#define PG_INSTR_CLOCK	CLOCK_REALTIME
+#define PG_INSTR_SYSTEM_CLOCK	CLOCK_REALTIME
 #endif
 
 static inline instr_time
-pg_get_ticks(void)
+pg_get_ticks_system(void)
 {
 	instr_time	now;
 	struct timespec tmp;
 
-	clock_gettime(PG_INSTR_CLOCK, &tmp);
+	clock_gettime(PG_INSTR_SYSTEM_CLOCK, &tmp);
 	now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
 
 	return now;
@@ -141,7 +187,7 @@ pg_get_ticks(void)
 /* On Windows, use QueryPerformanceCounter() for system clock source */
 
 static inline instr_time
-pg_get_ticks(void)
+pg_get_ticks_system(void)
 {
 	instr_time	now;
 	LARGE_INTEGER tmp;
@@ -190,6 +236,39 @@ pg_ticks_to_ns(int64 ticks)
 	return ns;
 }
 
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__)
+	if (likely(use_tsc))
+	{
+		instr_time	now;
+
+		now.ticks = __rdtsc();
+		return now;
+	}
+#endif
+
+	return pg_get_ticks_system();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__)
+	if (likely(use_tsc))
+	{
+		instr_time	now;
+		uint32		unused;
+
+		now.ticks = __rdtscp(&unused);
+		return now;
+	}
+#endif
+
+	return pg_get_ticks_system();
+}
+
 /*
  * Common macros
  */
@@ -198,8 +277,8 @@ pg_ticks_to_ns(int64 ticks)
 
 #define INSTR_TIME_SET_ZERO(t)	((t).ticks = 0)
 
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
-	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+	((t) = pg_get_ticks_fast())
 
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_get_ticks())
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 9c90670d9b8..a396e746415 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -162,6 +162,9 @@ extern const char *show_timezone(void);
 extern bool check_timezone_abbreviations(char **newval, void **extra,
 										 GucSource source);
 extern void assign_timezone_abbreviations(const char *newval, void *extra);
+extern void assign_timing_clock_source(int newval, void *extra);
+extern bool check_timing_clock_source(int *newval, void **extra, GucSource source);
+extern const char *show_timing_clock_source(void);
 extern bool check_transaction_buffers(int *newval, void **extra, GucSource source);
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 71a80161961..63440b8e36c 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -60,6 +60,7 @@ enum config_group
 	CONN_AUTH_TCP,
 	CONN_AUTH_AUTH,
 	CONN_AUTH_SSL,
+	RESOURCES_TIME,
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
-- 
2.47.1

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to