Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

David Geier Mon, 23 Feb 2026 07:25:18 -0800

Hi Lukas,

Thanks for taking care of incorporating the latest patch feedback.


On 13.02.2026 05:11, Lukas Fittl wrote:
> On Thu, Feb 12, 2026 at 4:41 PM Andres Freund <[email protected]> wrote:
>> On 2026-02-12 08:05:27 -0800, Lukas Fittl wrote:
> (1) changing the pg_ticks_to_ns logic to have an explicit
> "ticks_per_ns_scaled == 0" early check and return at the start, and
> setting ticks_per_ns_scaled to 0 when clock_gettime() gets used. This
> is similar to what David already suggested in an earlier email.
> (2) using uint64 for the ticks_per_ns_scaled/max_ticks_no_overflow
> variables - this appears to help GCC generate a bit shift reliably,
> instead of an idiv instruction.
> 
> That appears to eliminate the regression in my testing. Attached an
> updated v7, which also has some slightly improved commit messages.
> 
> Additional comparisons with the test case you had back at the start of
> this thread, with system clock source on my test VM:
> 
> master:
> 
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 1888.891 ms (best of 3)
> pg_test_timing / Average loop time including overhead: 23.53 ns
> 
> v6 (0002 + pg_test_timing prev/cur change):
> 
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 1897.095 ms (best of 3)
> pg_test_timing / Average loop time including overhead: 25.52 ns
> 
> v7 (0002):
> 
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 1897.148 ms (best of 3)
> Average loop time including overhead: 23.14 ns

Shouldn't that result be better than master because you optimized the
loop overhead in v7-0002? That's at least what I've measured, see test
results below.

> And when looking at the TSC time source with the full patch set on the same 
> VM:
> 
> v6:
> 
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 1477.672 ms (best of 3)
> pg_test_timing / Average loop time including overhead: 11.79 ns
> 
> v7:
> 
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 1476.326 ms (best of 3)
> pg_test_timing / Average loop time including overhead: 11.78 ns
> 
> Thanks,
> Lukas
> 
> [0]: https://godbolt.org/z/EvK1M66n5
> 
> --
> Lukas Fittl

The code wasn't compiling properly on Windows because __x86_64__ is not
defined in Visual C++. I've changed the code to use

  #if defined(__x86_64__) || defined(_M_X64)

I've also changed #include <x86intrin.h> to <immintrin.h>.


I've tested v8 of the patch (= v7 plus aforementioned changes) on
Windows. I'm reporting the best of 3 runs.

lotsarows test with parallelism disabled:

master: 2781 ms
v7:     2776 ms (timing_clock_source = 'system')
v7:     2091 ms (timing_clock_source = 'tsc')

pg_test_timing:

master: 27.04 ns
v7:     16.59 ns (QueryxPerformanceCounter)
v7:     13.69 ns (RDTSCP)
v7:      9.42 ns (RDTSC)


v8 of the patch is attached to this mail.

--
David Geier

From ba87233bd03213301be6a0612d2f6259cd05f03a Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 12 Feb 2026 01:12:19 -0800
Subject: [PATCH v8 4/4] pg_test_timing: Also test RDTSC/RDTSCP timing and
 report time source

Author: David Geier <[email protected]>
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
 src/bin/pg_test_timing/pg_test_timing.c | 83 ++++++++++++++++++++++---
 src/include/portability/instr_time.h    |  4 ++
 2 files changed, 78 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index 9fd630a490a..e5ec8843567 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -30,7 +30,7 @@ static long long int largest_diff_count;
 
 
 static void handle_args(int argc, char *argv[]);
-static uint64 test_timing(unsigned int duration);
+static uint64 test_timing(unsigned int duration, TimingClockSourceType source, bool fast_timing);
 static void output(uint64 loop_count);
 
 int
@@ -46,10 +46,47 @@ main(int argc, char *argv[])
 	/* initialize timing infrastructure (required for INSTR_* calls) */
 	pg_initialize_timing();
 
-	loop_count = test_timing(test_duration);
-
+	/*
+	 * First, test default (non-fast) timing code. A clock source for that is
+	 * always available. Hence, we can unconditionally output the result.
+	 */
+	loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_SYSTEM, false);
 	output(loop_count);
 
+#if defined(__x86_64__) || defined(_M_X64)
+
+	/*
+	 * If on a supported architecture, test the RDTSC clock source. This clock
+	 * source is not always available. In that case the loop count will be 0
+	 * and we don't print.
+	 *
+	 * We first emit RDTSCP timings, which is slower, and gets used for higher
+	 * precision measurements when the TSC clock source is enabled. We emit
+	 * RDTSC second, which is used for faster timing measurements with lower
+	 * precision.
+	 */
+	printf("\n");
+	loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, false);
+	if (loop_count > 0)
+	{
+		output(loop_count);
+		printf("\n");
+
+		/* Now, emit fast timing measurements */
+		loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, true);
+		output(loop_count);
+		printf("\n");
+
+		pg_set_timing_clock_source(TIMING_CLOCK_SOURCE_AUTO);
+		if (pg_current_timing_clock_source() == TIMING_CLOCK_SOURCE_TSC)
+			printf(_("TSC clock source will be used by default, unless timing_clock_source is set to 'system'.\n"));
+		else
+			printf(_("TSC clock source will not be used by default, unless timing_clock_source is set to 'tsc'.\n"));
+	}
+	else
+		printf(_("TSC clock source is not usable. Likely unable to determine TSC frequency. are you running in an unsupported virtualized environment?.\n"));
+#endif
+
 	return 0;
 }
 
@@ -146,14 +183,14 @@ handle_args(int argc, char *argv[])
 		exit(1);
 	}
 
-	printf(ngettext("Testing timing overhead for %u second.\n",
-					"Testing timing overhead for %u seconds.\n",
+	printf(ngettext("Testing timing overhead for %u second.\n\n",
+					"Testing timing overhead for %u seconds.\n\n",
 					test_duration),
 		   test_duration);
 }
 
 static uint64
-test_timing(unsigned int duration)
+test_timing(unsigned int duration, TimingClockSourceType source, bool fast_timing)
 {
 	uint64		total_time;
 	int64		time_elapsed = 0;
@@ -162,6 +199,25 @@ test_timing(unsigned int duration)
 				end_time,
 				prev,
 				cur;
+	char	   *time_source = NULL;
+
+	if (!pg_set_timing_clock_source(source))
+		return 0;
+
+#if defined(__x86_64__) || defined(_M_X64)
+	if (pg_current_timing_clock_source() == TIMING_CLOCK_SOURCE_TSC)
+		time_source = fast_timing ? "RDTSC" : "RDTSCP";
+	else
+		time_source = PG_INSTR_SYSTEM_CLOCK_NAME;
+#else
+	time_source = PG_INSTR_SYSTEM_CLOCK_NAME;
+#endif
+	if (fast_timing)
+		printf(_("Fast clock source: %s\n"), time_source);
+	else if (source == TIMING_CLOCK_SOURCE_SYSTEM)
+		printf(_("System clock source: %s\n"), time_source);
+	else
+		printf(_("Clock source: %s\n"), time_source);
 
 	/*
 	 * Pre-zero the statistics data structures.  They're already zero by
@@ -175,7 +231,10 @@ test_timing(unsigned int duration)
 
 	total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
 
-	INSTR_TIME_SET_CURRENT(start_time);
+	if (fast_timing)
+		INSTR_TIME_SET_CURRENT_FAST(start_time);
+	else
+		INSTR_TIME_SET_CURRENT(start_time);
 	cur = start_time;
 
 	while (time_elapsed < total_time)
@@ -184,7 +243,10 @@ test_timing(unsigned int duration)
 					bits;
 
 		prev = cur;
-		INSTR_TIME_SET_CURRENT(cur);
+		if (fast_timing)
+			INSTR_TIME_SET_CURRENT_FAST(cur);
+		else
+			INSTR_TIME_SET_CURRENT(cur);
 		diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
 
 		/* Did time go backwards? */
@@ -221,7 +283,10 @@ test_timing(unsigned int duration)
 		time_elapsed = INSTR_TIME_DIFF_NANOSEC(cur, start_time);
 	}
 
-	INSTR_TIME_SET_CURRENT(end_time);
+	if (fast_timing)
+		INSTR_TIME_SET_CURRENT_FAST(end_time);
+	else
+		INSTR_TIME_SET_CURRENT(end_time);
 
 	INSTR_TIME_SUBTRACT(end_time, start_time);
 
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e7191c5d6cd..58d52b39602 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -166,10 +166,13 @@ extern bool pg_set_timing_clock_source(TimingClockSourceType source);
  */
 #if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
 #define PG_INSTR_SYSTEM_CLOCK	CLOCK_MONOTONIC_RAW
+#define PG_INSTR_SYSTEM_CLOCK_NAME	"clock_gettime (CLOCK_MONOTONIC_RAW)"
 #elif defined(CLOCK_MONOTONIC)
 #define PG_INSTR_SYSTEM_CLOCK	CLOCK_MONOTONIC
+#define PG_INSTR_SYSTEM_CLOCK_NAME	"clock_gettime (CLOCK_MONOTONIC)"
 #else
 #define PG_INSTR_SYSTEM_CLOCK	CLOCK_REALTIME
+#define PG_INSTR_SYSTEM_CLOCK_NAME	"clock_gettime (CLOCK_REALTIME)"
 #endif
 
 static inline instr_time
@@ -188,6 +191,7 @@ pg_get_ticks_system(void)
 
 /* On Windows, use QueryPerformanceCounter() for system clock source */
 
+#define PG_INSTR_SYSTEM_CLOCK_NAME	"QueryPerformanceCounter"
 static inline instr_time
 pg_get_ticks_system(void)
 {
-- 
2.53.0.windows.1

From 2cdcdce47e82fe97df4fe36fb4ffc02a718210e3 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 12 Feb 2026 01:09:48 -0800
Subject: [PATCH v8 3/4] Timing: Use Time-Stamp Counter (TSC) on x86-64 for
 faster measurements

This allows the direct use of the Time-Stamp Counter (TSC) value retrieved
from the CPU using RDTSC/RDTSC instructions, instead of APIs like
clock_gettime() on POSIX systems. This reduces the overhead of EXPLAIN with
ANALYZE and TIMING ON. Tests showed that runtime when instrumented can be
reduced by up to 10% for queries moving lots of rows through the plan.

To control use of the TSC, the new "timing_clock_source" GUC is introduced,
whose default ("auto") automatically uses the TSC when running on Linux/x86-64,
in case the system clocksource is reported as "tsc". The use of the system
APIs can be enforced by setting "system", or on x86-64 architectures the
use of TSC can be enforced by explicitly setting "tsc".

Note, that we further split the use of the TSC into the RDTSC CPU instruction
which does not wait for out-of-order execution (faster, less precise)
and the RDTSCP instruction, which waits for outstanding instructions to
retire. RDTSCP is deemed to have little benefit in the typical
InstrStartNode() / InstrStopNode() use case of EXPLAIN, and can be up to
twice as slow. To separate these use cases, the new macro
INSTR_TIME_SET_CURRENT_FAST() is introduced, which uses RDTSC.

The original macro INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed
to be used when precision is more important than performance. When the
system timing clock source is used both of these macros instead utilize
the system APIs (clock_gettime / QueryPerformanceCounter) like before.

Author: David Geier <[email protected]>
Author: Andres Freund <[email protected]>
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
 doc/src/sgml/config.sgml                      |  55 ++++
 src/backend/executor/instrument.c             |  12 +-
 src/backend/postmaster/postmaster.c           |   3 -
 src/backend/utils/misc/guc_parameters.dat     |  11 +
 src/backend/utils/misc/guc_tables.c           |  11 +
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/common/instr_time.c                       | 293 +++++++++++++++++-
 src/include/portability/instr_time.h          | 109 ++++++-
 src/include/utils/guc_hooks.h                 |   3 +
 src/include/utils/guc_tables.h                |   1 +
 10 files changed, 475 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 20dbcaeb3ee..bd6787caa71 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2532,6 +2532,61 @@ include_dir 'conf.d'
      </variablelist>
     </sect2>
 
+    <sect2 id="runtime-config-resource-time">
+     <title>Timing</title>
+
+     <variablelist>
+     <varlistentry id="guc-timing-clock-source" xreflabel="timing_clock_source">
+      <term><varname>timing_clock_source</varname> (<type>enum</type>)
+      <indexterm>
+       <primary><varname>timing_clock_source</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Selects the method for making timing measurements using the OS or specialized CPU
+        instructions. Possible values are:
+         <itemizedlist>
+          <listitem>
+           <para>
+            <literal>auto</literal> (automatically chooses TSC clock source for Linux-based
+            x86-64 systems that utilize "tsc" as their system clock source, otherwise uses
+            the OS system clock)
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>system</literal> (measures timing using the OS system clock)
+           </para>
+          </listitem>
+          <listitem>
+           <para>
+            <literal>tsc</literal> (measures timing using the x86-64 Time-Stamp Counter (TSC)
+            by directly executing RDTSC/RDTSCP instructions, see below)
+           </para>
+          </listitem>
+         </itemizedlist>
+         The default is <literal>auto</literal>.
+        </para>
+        <para>
+          If enabled, the TSC clock source will use the RDTSC instruction for the x86-64
+          Time-Stamp Counter (TSC) to perform certain time measurements, for example during
+          EXPLAIN ANALYZE. The RDTSC instruction has less overhead than going through the OS
+          clock source, which for an EXPLAIN ANALYZE statement will show timing closer to the
+          actual runtime when timing is off. For timings that require higher precision the
+          RDTSCP instruction is used, which avoids inaccuracies due to CPU instruction re-ordering.
+          Use of RDTSC/RDTSC is not supported on Windows or on other architectures, and is not
+          advised on systems that utilize an emulated TSC.
+        </para>
+        <para>
+          To help decide which clock source to use on an x86-64 system you can run the
+          <application>pg_test_timing</application> utility to check TSC availability, and
+          perform timing measurements.
+        </para>
+      </listitem>
+     </varlistentry>
+     </variablelist>
+    </sect2>
 
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index edab92a0ebe..ebdad31ca3b 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
 void
 InstrStartNode(Instrumentation *instr)
 {
-	if (instr->need_timer &&
-		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
-		elog(ERROR, "InstrStartNode called twice in a row");
+	if (instr->need_timer)
+	{
+		if (!INSTR_TIME_IS_ZERO(instr->starttime))
+			elog(ERROR, "InstrStartNode called twice in a row");
+		else
+			INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+	}
 
 	/* save buffer usage totals at node entry, if needed */
 	if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 		if (INSTR_TIME_IS_ZERO(instr->starttime))
 			elog(ERROR, "InstrStopNode called without start");
 
-		INSTR_TIME_SET_CURRENT(endtime);
+		INSTR_TIME_SET_CURRENT_FAST(endtime);
 		INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
 
 		INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 60bd06ed665..3fac46c402b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -588,9 +588,6 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitializeGUCOptions();
 
-	/* initialize timing infrastructure (required for INSTR_* calls) */
-	pg_initialize_timing();
-
 	opterr = 1;
 
 	/*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 9507778415d..cfbe70802b6 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -2988,6 +2988,17 @@
   assign_hook => 'assign_timezone_abbreviations',
 },
 
+{ name => 'timing_clock_source', type => 'enum', context => 'PGC_USERSET', group => 'RESOURCES_TIME',
+  short_desc => 'Controls the clock source used for collecting timing measurements.',
+  long_desc => 'This enables the use of specialized clock sources, specifically the RDTSC clock source on x86-64 systems (if available), to support timing measurements with lower overhead during EXPLAIN and other instrumentation.',
+  variable => 'timing_clock_source',
+  boot_val => 'TIMING_CLOCK_SOURCE_AUTO',
+  options => 'timing_clock_source_options',
+  check_hook => 'check_timing_clock_source',
+  assign_hook => 'assign_timing_clock_source',
+  show_hook => 'show_timing_clock_source',
+},
+
 { name => 'trace_connection_negotiation', type => 'bool', context => 'PGC_POSTMASTER', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Logs details of pre-authentication connection handshake.',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 741fce8dede..f0bc2e34797 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -91,6 +91,7 @@
 #include "tcop/tcopprot.h"
 #include "tsearch/ts_cache.h"
 #include "utils/builtins.h"
+#include "portability/instr_time.h"
 #include "utils/bytea.h"
 #include "utils/float.h"
 #include "utils/guc_hooks.h"
@@ -372,6 +373,15 @@ static const struct config_enum_entry huge_pages_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry timing_clock_source_options[] = {
+	{"auto", TIMING_CLOCK_SOURCE_AUTO, false},
+	{"system", TIMING_CLOCK_SOURCE_SYSTEM, false},
+#if defined(__x86_64__) || defined(_M_X64)
+	{"tsc", TIMING_CLOCK_SOURCE_TSC, false},
+#endif
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry huge_pages_status_options[] = {
 	{"off", HUGE_PAGES_OFF, false},
 	{"on", HUGE_PAGES_ON, false},
@@ -722,6 +732,7 @@ const char *const config_group_names[] =
 	[CONN_AUTH_TCP] = gettext_noop("Connections and Authentication / TCP Settings"),
 	[CONN_AUTH_AUTH] = gettext_noop("Connections and Authentication / Authentication"),
 	[CONN_AUTH_SSL] = gettext_noop("Connections and Authentication / SSL"),
+	[RESOURCES_TIME] = gettext_noop("Resource Usage / Time"),
 	[RESOURCES_MEM] = gettext_noop("Resource Usage / Memory"),
 	[RESOURCES_DISK] = gettext_noop("Resource Usage / Disk"),
 	[RESOURCES_KERNEL] = gettext_noop("Resource Usage / Kernel Resources"),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f938cc65a3a..c11d88348f0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -193,6 +193,10 @@
 #max_files_per_process = 1000           # min 64
                                         # (change requires restart)
 
+# - Time -
+
+#timing_clock_source = auto             # auto, system, tsc (if supported)
+
 # - Background Writer -
 
 #bgwriter_delay = 200ms                 # 10-10000ms between rounds
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
index 3e8ee2d6db2..860bb5a05ca 100644
--- a/src/common/instr_time.c
+++ b/src/common/instr_time.c
@@ -20,8 +20,8 @@
  * Stores what the number of ticks needs to be multiplied with to end up
  * with nanoseconds using integer math.
  *
- * On certain platforms (currently Windows) the ticks to nanoseconds conversion
- * requires floating point math because:
+ * In certain cases (TSC on x86-64, and QueryPerformanceCounter on Windows)
+ * the ticks to nanoseconds conversion requires floating point math because:
  *
  * sec = ticks / frequency_hz
  * ns  = ticks / frequency_hz * 1,000,000,000
@@ -39,7 +39,7 @@
  * division. We utilize unsigned integers even though ticks are stored as a
  * signed value because that encourages compilers to generate better assembly.
  *
- * On all other platforms we are using clock_gettime(), which uses nanoseconds
+ * In all other cases we are using clock_gettime(), which uses nanoseconds
  * as ticks. Hence, we set the multiplier to zero, which causes pg_ticks_to_ns
  * to return the original value.
  */
@@ -48,16 +48,57 @@ uint64		max_ticks_no_overflow = 0;
 
 static void set_ticks_per_ns(void);
 
+int			timing_clock_source = TIMING_CLOCK_SOURCE_AUTO;
+
+#if defined(__x86_64__) || defined(_M_X64)
+/* Indicates if TSC instructions (RDTSC and RDTSCP) are usable. */
+extern bool has_usable_tsc;
+
+static void tsc_initialize(void);
+static bool tsc_use_by_default(void);
+static void set_ticks_per_ns_for_tsc(void);
+static bool set_tsc_frequency_khz(void);
+static bool is_rdtscp_available(void);
+#endif
+
 void
 pg_initialize_timing()
 {
+#if defined(__x86_64__) || defined(_M_X64)
+	tsc_initialize();
+#endif
+
+	set_ticks_per_ns();
+}
+
+bool
+pg_set_timing_clock_source(TimingClockSourceType source)
+{
+#if defined(__x86_64__) || defined(_M_X64)
+	switch (source)
+	{
+		case TIMING_CLOCK_SOURCE_AUTO:
+			use_tsc = has_usable_tsc && tsc_use_by_default();
+			break;
+		case TIMING_CLOCK_SOURCE_SYSTEM:
+			use_tsc = false;
+			break;
+		case TIMING_CLOCK_SOURCE_TSC:
+			if (!has_usable_tsc)	/* Tell caller TSC is not usable */
+				return false;
+			use_tsc = true;
+			break;
+	}
+#endif
 	set_ticks_per_ns();
+	timing_clock_source = source;
+	return true;
 }
 
 #ifndef WIN32
 
 static void
-set_ticks_per_ns()
+set_ticks_per_ns_system()
 {
 	ticks_per_ns_scaled = 0;
 	max_ticks_no_overflow = 0;
@@ -76,10 +117,252 @@ GetTimerFrequency(void)
 }
 
 static void
-set_ticks_per_ns()
+set_ticks_per_ns_system()
 {
 	ticks_per_ns_scaled = INT64CONST(1000000000) * TICKS_TO_NS_PRECISION / GetTimerFrequency();
 	max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
 }
 
 #endif							/* WIN32 */
+
+static void
+set_ticks_per_ns()
+{
+#if defined(__x86_64__) || defined(_M_X64)
+	if (use_tsc)
+		set_ticks_per_ns_for_tsc();
+	else
+		set_ticks_per_ns_system();
+#else
+	set_ticks_per_ns_system();
+#endif
+}
+
+/* GUC handling */
+
+#ifndef FRONTEND
+
+#include "utils/guc_hooks.h"
+
+bool
+check_timing_clock_source(int *newval, void **extra, GucSource source)
+{
+#if defined(__x86_64__) || defined(_M_X64)
+	pg_initialize_timing();
+
+	if (*newval == TIMING_CLOCK_SOURCE_TSC && !has_usable_tsc)
+	{
+		GUC_check_errdetail("TSC is not supported as fast clock source");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_timing_clock_source(int newval, void *extra)
+{
+	/*
+	 * Ignore the return code since the check hook already verified TSC is
+	 * usable if its explicitly requested
+	 */
+	pg_set_timing_clock_source(newval);
+}
+
+const char *
+show_timing_clock_source()
+{
+#if defined(__x86_64__) || defined(_M_X64)
+	TimingClockSourceType effective_source = pg_current_timing_clock_source();
+
+	switch (timing_clock_source)
+	{
+		case TIMING_CLOCK_SOURCE_AUTO:
+			if (effective_source == TIMING_CLOCK_SOURCE_TSC)
+				return "auto (tsc)";
+			else
+				return "auto (system)";
+		case TIMING_CLOCK_SOURCE_SYSTEM:
+			return "system";
+		case TIMING_CLOCK_SOURCE_TSC:
+			return "tsc";
+	}
+#else
+	switch (timing_clock_source)
+	{
+		case TIMING_CLOCK_SOURCE_AUTO:
+			return "auto (system)";
+		case TIMING_CLOCK_SOURCE_SYSTEM:
+			return "system";
+	}
+#endif
+
+	/* unreachable */
+	return "?";
+}
+
+#endif							/* !FRONTEND */
+
+/* TSC specific logic */
+
+#if defined(__x86_64__) || defined(_M_X64)
+
+#if defined(HAVE__GET_CPUID) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
+#include <cpuid.h>
+#endif
+
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
+#include <intrin.h>
+#endif
+
+bool		has_usable_tsc = false;
+bool		use_tsc = false;
+
+static uint32 tsc_frequency_khz = 0;
+
+/*
+ * Decide whether we use the RDTSC/RDTSCP instructions at runtime, for Linux/x86-64,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ */
+static void
+tsc_initialize(void)
+{
+	/*
+	 * Compute baseline CPU peformance, determines speed at which the TSC
+	 * advances.
+	 */
+	if (!set_tsc_frequency_khz())
+		return;
+
+	has_usable_tsc = is_rdtscp_available();
+}
+
+/*
+ * Decides whether to use TSC clock source if the user did not specify it
+ * one way or the other, and it is available (checked separately).
+ *
+ * Currently only enabled by default on Linux, since Linux already does a
+ * significant amount of work to determine whether TSC is a viable clock
+ * source.
+ */
+static bool
+tsc_use_by_default()
+{
+#if defined(__linux__)
+	FILE	   *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+	char		buf[128];
+
+	if (!fp)
+		return false;
+
+	if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+	{
+		fclose(fp);
+		return true;
+	}
+
+	fclose(fp);
+#endif
+
+	return false;
+}
+
+static void
+set_ticks_per_ns_for_tsc()
+{
+	ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_frequency_khz;
+	max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+}
+
+#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
+#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 0x564b4d56 && words[3] == 0x0000004d)	/* KVMKVMKVM */
+
+static bool
+set_tsc_frequency_khz()
+{
+	uint32		r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , &r[2] /* hz */ , &r[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(r, 0x15);
+#else
+#error cpuid instruction not available
+#endif
+
+	if (r[2] > 0)
+	{
+		if (r[0] == 0 || r[1] == 0)
+			return false;
+
+		tsc_frequency_khz = r[2] / 1000 * r[1] / r[0];
+		return true;
+	}
+
+	/* Some CPUs only report frequency in 16H */
+
+#if defined(HAVE__GET_CPUID)
+	__get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]);
+#elif defined(HAVE__CPUID)
+	__cpuid(r, 0x16);
+#else
+#error cpuid instruction not available
+#endif
+
+	if (r[0] > 0)
+	{
+		tsc_frequency_khz = r[0] * 1000;
+		return true;
+	}
+
+	/*
+	 * Check if we have a KVM or VMware Hypervisor passing down TSC frequency
+	 * to us in a guest VM
+	 *
+	 * Note that accessing the 0x40000000 leaf for Hypervisor info requires
+	 * use of __cpuidex to set ECX to 0. The similar __get_cpuid_count
+	 * function does not work as expected since it contains a check for
+	 * __get_cpuid_max, which has been observed to be lower than the special
+	 * Hypervisor leaf.
+	 */
+#if defined(HAVE__CPUIDEX)
+	__cpuidex((int32 *) r, 0x40000000, 0);
+	if (r[0] >= 0x40000010 && (CPUID_HYPERVISOR_VMWARE(r) || CPUID_HYPERVISOR_KVM(r)))
+	{
+		__cpuidex((int32 *) r, 0x40000010, 0);
+		if (r[0] > 0)
+		{
+			tsc_frequency_khz = r[0];
+			return true;
+		}
+	}
+#endif
+
+	return false;
+}
+
+static bool
+is_rdtscp_available()
+{
+	uint32		r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+	if (!__get_cpuid(0x80000001, &r[0], &r[1], &r[2], &r[3]))
+		return false;
+#elif defined(HAVE__CPUID)
+	__cpuid(r, 0x80000001);
+#else
+#error cpuid instruction not available
+#endif
+
+	return (r[3] & (1 << 27)) != 0;
+}
+
+#endif							/* defined(__x86_64__) || defined(_M_X64) */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 985b6b5af88..e7191c5d6cd 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,10 @@
  *	  portable high-precision interval timing
  *
  * This file provides an abstraction layer to hide portability issues in
- * interval timing.  On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter().  These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On x86 we use the RDTSC/RDTSCP instruction directly in
+ * certain cases, or alternatively clock_gettime() on Unix-like systems and
+ * QueryPerformanceCounter() on Windows. These macros also give some breathing
+ * room to use other high-precision-timing APIs.
  *
  * The basic data type is instr_time, which all callers should treat as an
  * opaque typedef.  instr_time can store either an absolute time (of
@@ -17,10 +18,11 @@
  *
  * INSTR_TIME_SET_ZERO(t)			set t to zero (memset is acceptable too)
  *
- * INSTR_TIME_SET_CURRENT(t)		set t to current time
+ * INSTR_TIME_SET_CURRENT_FAST(t)	set t to current time without waiting
+ * 									for instructions in out-of-order window
  *
- * INSTR_TIME_SET_CURRENT_LAZY(t)	set t to current time if t is zero,
- *									evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT(t)		set t to current time while waiting for
+ * 									instructions in OOO to retire
  *
  * INSTR_TIME_ADD(x, y)				x += y
  *
@@ -93,13 +95,54 @@ typedef struct instr_time
 extern PGDLLIMPORT uint64 ticks_per_ns_scaled;
 extern PGDLLIMPORT uint64 max_ticks_no_overflow;
 
+#if defined(__x86_64__) || defined(_M_X64)
+#include <immintrin.h>
+
+/* Whether to actually use RDTSC/RDTSCP based on availability and GUC settings. */
+extern PGDLLIMPORT bool use_tsc;
+#endif
+
+typedef enum
+{
+	TIMING_CLOCK_SOURCE_AUTO,
+	TIMING_CLOCK_SOURCE_SYSTEM,
+	TIMING_CLOCK_SOURCE_TSC
+}			TimingClockSourceType;
+
+extern int	timing_clock_source;
+
+/*
+ * Returns the current timing clock source effectively in use, resolving
+ * TIMING_CLOCK_SOURCE_AUTO to either TIMING_CLOCK_SOURCE_SYSTEM or
+ * TIMING_CLOCK_SOURCE_TSC.
+ */
+static inline TimingClockSourceType pg_current_timing_clock_source(void)
+{
+#if defined(__x86_64__) || defined(_M_X64)
+	return use_tsc ? TIMING_CLOCK_SOURCE_TSC : TIMING_CLOCK_SOURCE_SYSTEM;
+#else
+	return TIMING_CLOCK_SOURCE_SYSTEM;
+#endif
+}
+
 /*
  * Initialize timing infrastructure
  *
- * This must be called at least once before using INSTR_TIME_SET_CURRENT* macros.
+ * This must be called at least once by frontend programs before using
+ * INSTR_TIME_SET_CURRENT* macros. Backend programs automatically initialize
+ * this through the GUC check hook.
  */
 extern void pg_initialize_timing(void);
 
+/*
+ * Sets the time source to be used. Mainly intended for frontend programs,
+ * the backend should set it via the timing_clock_source GUC instead.
+ *
+ * Returns false if the clock source could not be set, for example when TSC
+ * is not available despite being explicitly set.
+ */
+extern bool pg_set_timing_clock_source(TimingClockSourceType source);
+
 #ifndef WIN32
 
 /* On POSIX, use clock_gettime() for system clock source */
@@ -117,22 +160,25 @@ extern void pg_initialize_timing(void);
  * than CLOCK_MONOTONIC.  In particular, as of macOS 10.12, Apple provides
  * CLOCK_MONOTONIC_RAW which is both faster to read and higher resolution than
  * their version of CLOCK_MONOTONIC.
+ *
+ * Note this does not get used in case the TSC clock source logic is used,
+ * which directly calls architecture specific timing instructions (e.g. RDTSC).
  */
 #if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
-#define PG_INSTR_CLOCK	CLOCK_MONOTONIC_RAW
+#define PG_INSTR_SYSTEM_CLOCK	CLOCK_MONOTONIC_RAW
 #elif defined(CLOCK_MONOTONIC)
-#define PG_INSTR_CLOCK	CLOCK_MONOTONIC
+#define PG_INSTR_SYSTEM_CLOCK	CLOCK_MONOTONIC
 #else
-#define PG_INSTR_CLOCK	CLOCK_REALTIME
+#define PG_INSTR_SYSTEM_CLOCK	CLOCK_REALTIME
 #endif
 
 static inline instr_time
-pg_get_ticks(void)
+pg_get_ticks_system(void)
 {
 	instr_time	now;
 	struct timespec tmp;
 
-	clock_gettime(PG_INSTR_CLOCK, &tmp);
+	clock_gettime(PG_INSTR_SYSTEM_CLOCK, &tmp);
 	now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
 
 	return now;
@@ -143,7 +189,7 @@ pg_get_ticks(void)
 /* On Windows, use QueryPerformanceCounter() for system clock source */
 
 static inline instr_time
-pg_get_ticks(void)
+pg_get_ticks_system(void)
 {
 	instr_time	now;
 	LARGE_INTEGER tmp;
@@ -199,6 +245,39 @@ pg_ticks_to_ns(int64 ticks)
 #endif
 }
 
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) || defined(_M_X64)
+	if (likely(use_tsc))
+	{
+		instr_time	now;
+
+		now.ticks = __rdtsc();
+		return now;
+	}
+#endif
+
+	return pg_get_ticks_system();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) || defined(_M_X64)
+	if (likely(use_tsc))
+	{
+		instr_time	now;
+		uint32		unused;
+
+		now.ticks = __rdtscp(&unused);
+		return now;
+	}
+#endif
+
+	return pg_get_ticks_system();
+}
+
 /*
  * Common macros
  */
@@ -207,8 +286,8 @@ pg_ticks_to_ns(int64 ticks)
 
 #define INSTR_TIME_SET_ZERO(t)	((t).ticks = 0)
 
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
-	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+	((t) = pg_get_ticks_fast())
 
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_get_ticks())
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 9c90670d9b8..a396e746415 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -162,6 +162,9 @@ extern const char *show_timezone(void);
 extern bool check_timezone_abbreviations(char **newval, void **extra,
 										 GucSource source);
 extern void assign_timezone_abbreviations(const char *newval, void *extra);
+extern void assign_timing_clock_source(int newval, void *extra);
+extern bool check_timing_clock_source(int *newval, void **extra, GucSource source);
+extern const char *show_timing_clock_source(void);
 extern bool check_transaction_buffers(int *newval, void **extra, GucSource source);
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource source);
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 71a80161961..63440b8e36c 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -60,6 +60,7 @@ enum config_group
 	CONN_AUTH_TCP,
 	CONN_AUTH_AUTH,
 	CONN_AUTH_SSL,
+	RESOURCES_TIME,
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
-- 
2.53.0.windows.1

From 2392d95626599a1b5562f9216eb8c334db99c932 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Fri, 25 Jul 2025 17:57:20 -0700
Subject: [PATCH v8 2/4] Timing: Streamline ticks to nanosecond conversion
 across platforms

The timing infrastructure (INSTR_* macros) measures time elapsed using
clock_gettime() on POSIX systems, which returns the time as nanoseconds,
and QueryPerformanceCounter() on Windows, which is a specialized timing
clock source that returns a tick counter that needs to be converted to
nanoseconds using the result of QueryPerformanceFrequency().

This conversion currently happens ad-hoc on Windows, e.g. when calling
INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
invocation, despite the frequency being stable after program start,
incurring unnecessary overhead. It also causes a fractured implementation
where macros are defined differently between platforms.

To ease code readability, and prepare for a future change that intends
to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
a new pg_ticks_to_ns() function that gets called on all platforms.

This function relies on a separately initialized ticks_per_ns_scaled
value, that represents the conversion ratio. This value is initialized
from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
POSIX systems, which results in the ticks being treated as nanoseconds.
Other architectures always directly return the original ticks.

To support this, pg_initialize_timing() is introduced, and is now
mandatory for both the backend and any frontend programs to call before
utilizing INSTR_* macros.

In passing modify pg_test_timing to reduce the per-loop overhead caused
by repeated divisions in INSTR_TIME_GET_NANOSEC when the ticks variable
has become very large. Instead diff first and then turn it into nanosecs.

Author: Lukas Fittl <[email protected]>
Author: Andres Freund <[email protected]>
Author: David Geier <[email protected]>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
 src/backend/postmaster/postmaster.c     |   3 +
 src/bin/pg_test_timing/pg_test_timing.c |  18 ++--
 src/bin/pgbench/pgbench.c               |   3 +
 src/bin/psql/startup.c                  |   4 +
 src/common/Makefile                     |   1 +
 src/common/instr_time.c                 |  85 +++++++++++++++++++
 src/common/meson.build                  |   1 +
 src/include/portability/instr_time.h    | 106 +++++++++++++++++-------
 8 files changed, 181 insertions(+), 40 deletions(-)
 create mode 100644 src/common/instr_time.c

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3fac46c402b..60bd06ed665 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -588,6 +588,9 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitializeGUCOptions();
 
+	/* initialize timing infrastructure (required for INSTR_* calls) */
+	pg_initialize_timing();
+
 	opterr = 1;
 
 	/*
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index a5621251afc..9fd630a490a 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -43,6 +43,9 @@ main(int argc, char *argv[])
 
 	handle_args(argc, argv);
 
+	/* initialize timing infrastructure (required for INSTR_* calls) */
+	pg_initialize_timing();
+
 	loop_count = test_timing(test_duration);
 
 	output(loop_count);
@@ -155,11 +158,10 @@ test_timing(unsigned int duration)
 	uint64		total_time;
 	int64		time_elapsed = 0;
 	uint64		loop_count = 0;
-	uint64		prev,
-				cur;
 	instr_time	start_time,
 				end_time,
-				temp;
+				prev,
+				cur;
 
 	/*
 	 * Pre-zero the statistics data structures.  They're already zero by
@@ -174,7 +176,7 @@ test_timing(unsigned int duration)
 	total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
 
 	INSTR_TIME_SET_CURRENT(start_time);
-	cur = INSTR_TIME_GET_NANOSEC(start_time);
+	cur = start_time;
 
 	while (time_elapsed < total_time)
 	{
@@ -182,9 +184,8 @@ test_timing(unsigned int duration)
 					bits;
 
 		prev = cur;
-		INSTR_TIME_SET_CURRENT(temp);
-		cur = INSTR_TIME_GET_NANOSEC(temp);
-		diff = cur - prev;
+		INSTR_TIME_SET_CURRENT(cur);
+		diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
 
 		/* Did time go backwards? */
 		if (unlikely(diff < 0))
@@ -217,8 +218,7 @@ test_timing(unsigned int duration)
 			largest_diff_count++;
 
 		loop_count++;
-		INSTR_TIME_SUBTRACT(temp, start_time);
-		time_elapsed = INSTR_TIME_GET_NANOSEC(temp);
+		time_elapsed = INSTR_TIME_DIFF_NANOSEC(cur, start_time);
 	}
 
 	INSTR_TIME_SET_CURRENT(end_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index cb4e986092e..c8b233be16c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7334,6 +7334,9 @@ main(int argc, char **argv)
 		initRandomState(&state[i].cs_func_rs);
 	}
 
+	/* initialize timing infrastructure (required for INSTR_* calls) */
+	pg_initialize_timing();
+
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 9a397ec87b7..69d044d405d 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
 #include "help.h"
 #include "input.h"
 #include "mainloop.h"
+#include "portability/instr_time.h"
 #include "settings.h"
 
 /*
@@ -327,6 +328,9 @@ main(int argc, char *argv[])
 
 	PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
 
+	/* initialize timing infrastructure (required for INSTR_* calls) */
+	pg_initialize_timing();
+
 	SyncVariables();
 
 	if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 2c720caa509..1a2fbbe887f 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
 	file_perm.o \
 	file_utils.o \
 	hashfn.o \
+	instr_time.o \
 	ip.o \
 	jsonapi.o \
 	keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 00000000000..3e8ee2d6db2
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,85 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ *	   Non-inline parts of the portable high-precision interval timing
+ *	 implementation
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+/*
+ * Stores what the number of ticks needs to be multiplied with to end up
+ * with nanoseconds using integer math.
+ *
+ * On certain platforms (currently Windows) the ticks to nanoseconds conversion
+ * requires floating point math because:
+ *
+ * sec = ticks / frequency_hz
+ * ns  = ticks / frequency_hz * 1,000,000,000
+ * ns  = ticks * (1,000,000,000 / frequency_hz)
+ * ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a 2.5 GHz CPU
+ * the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of precision. We
+ * first scale the integer up and after the multiplication by the number
+ * of ticks in INSTR_TIME_GET_NANOSEC() we divide again by the same value.
+ * We picked the scaler such that it provides enough precision and is a
+ * power-of-two which allows for shifting instead of doing an integer
+ * division. We utilize unsigned integers even though ticks are stored as a
+ * signed value because that encourages compilers to generate better assembly.
+ *
+ * On all other platforms we are using clock_gettime(), which uses nanoseconds
+ * as ticks. Hence, we set the multiplier to zero, which causes pg_ticks_to_ns
+ * to return the original value.
+ */
+uint64		ticks_per_ns_scaled = 0;
+uint64		max_ticks_no_overflow = 0;
+
+static void set_ticks_per_ns(void);
+
+void
+pg_initialize_timing()
+{
+	set_ticks_per_ns();
+}
+
+#ifndef WIN32
+
+static void
+set_ticks_per_ns()
+{
+	ticks_per_ns_scaled = 0;
+	max_ticks_no_overflow = 0;
+}
+
+#else							/* WIN32 */
+
+/* GetTimerFrequency returns counts per second */
+static inline double
+GetTimerFrequency(void)
+{
+	LARGE_INTEGER f;
+
+	QueryPerformanceFrequency(&f);
+	return (double) f.QuadPart;
+}
+
+static void
+set_ticks_per_ns()
+{
+	ticks_per_ns_scaled = INT64CONST(1000000000) * TICKS_TO_NS_PRECISION / GetTimerFrequency();
+	max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+}
+
+#endif							/* WIN32 */
diff --git a/src/common/meson.build b/src/common/meson.build
index b757618a9c9..042edb7473a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -13,6 +13,7 @@ common_sources = files(
   'file_perm.c',
   'file_utils.c',
   'hashfn.c',
+  'instr_time.c',
   'ip.c',
   'jsonapi.c',
   'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 490593d1825..985b6b5af88 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -26,6 +26,8 @@
  *
  * INSTR_TIME_SUBTRACT(x, y)		x -= y
  *
+ * INSTR_TIME_DIFF_NANOSEC(x, y)	x - y (in nanoseconds)
+ *
  * INSTR_TIME_ACCUM_DIFF(x, y, z)	x += (y - z)
  *
  * INSTR_TIME_GET_DOUBLE(t)			convert t to double (in seconds)
@@ -78,11 +80,29 @@ typedef struct instr_time
 #define NS_PER_MS	INT64CONST(1000000)
 #define NS_PER_US	INT64CONST(1000)
 
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
 
-#ifndef WIN32
+/*
+ * Variables used to translate ticks to nanoseconds, initialized by
+ * pg_initialize_timing.
+ */
+extern PGDLLIMPORT uint64 ticks_per_ns_scaled;
+extern PGDLLIMPORT uint64 max_ticks_no_overflow;
 
+/*
+ * Initialize timing infrastructure
+ *
+ * This must be called at least once before using INSTR_TIME_SET_CURRENT* macros.
+ */
+extern void pg_initialize_timing(void);
 
-/* Use clock_gettime() */
+#ifndef WIN32
+
+/* On POSIX, use clock_gettime() for system clock source */
 
 #include <time.h>
 
@@ -106,9 +126,8 @@ typedef struct instr_time
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-/* helper for INSTR_TIME_SET_CURRENT */
 static inline instr_time
-pg_clock_gettime_ns(void)
+pg_get_ticks(void)
 {
 	instr_time	now;
 	struct timespec tmp;
@@ -119,21 +138,12 @@ pg_clock_gettime_ns(void)
 	return now;
 }
 
-#define INSTR_TIME_SET_CURRENT(t) \
-	((t) = pg_clock_gettime_ns())
-
-#define INSTR_TIME_GET_NANOSEC(t) \
-	((int64) (t).ticks)
-
-
 #else							/* WIN32 */
 
+/* On Windows, use QueryPerformanceCounter() for system clock source */
 
-/* Use QueryPerformanceCounter() */
-
-/* helper for INSTR_TIME_SET_CURRENT */
 static inline instr_time
-pg_query_performance_counter(void)
+pg_get_ticks(void)
 {
 	instr_time	now;
 	LARGE_INTEGER tmp;
@@ -144,23 +154,50 @@ pg_query_performance_counter(void)
 	return now;
 }
 
-static inline double
-GetTimerFrequency(void)
-{
-	LARGE_INTEGER f;
-
-	QueryPerformanceFrequency(&f);
-	return (double) f.QuadPart;
-}
-
-#define INSTR_TIME_SET_CURRENT(t) \
-	((t) = pg_query_performance_counter())
-
-#define INSTR_TIME_GET_NANOSEC(t) \
-	((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
-
 #endif							/* WIN32 */
 
+static inline int64
+pg_ticks_to_ns(int64 ticks)
+{
+#if defined(__x86_64__) || defined(_M_X64)
+	int64		ns = 0;
+
+	if (ticks_per_ns_scaled == 0)
+		return ticks;
+
+	/*
+	 * Would multiplication overflow? If so perform computation in two parts.
+	 * Check overflow without actually overflowing via: a * b > max <=> a >
+	 * max / b
+	 */
+	if (unlikely(ticks > (int64) max_ticks_no_overflow))
+	{
+		/*
+		 * Compute how often the maximum number of ticks fits completely into
+		 * the number of elapsed ticks and convert that number into
+		 * nanoseconds. Then multiply by the count to arrive at the final
+		 * value. In a 2nd step we adjust the number of elapsed ticks and
+		 * convert the remaining ticks.
+		 */
+		int64		count = ticks / max_ticks_no_overflow;
+		int64		max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+
+		ns = max_ns * count;
+
+		/*
+		 * Subtract the ticks that we now already accounted for, so that they
+		 * don't get counted twice.
+		 */
+		ticks -= count * max_ticks_no_overflow;
+		Assert(ticks >= 0);
+	}
+
+	ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+	return ns;
+#else
+	return ticks;
+#endif
+}
 
 /*
  * Common macros
@@ -168,12 +205,13 @@ GetTimerFrequency(void)
 
 #define INSTR_TIME_IS_ZERO(t)	((t).ticks == 0)
 
-
 #define INSTR_TIME_SET_ZERO(t)	((t).ticks = 0)
 
 #define INSTR_TIME_SET_CURRENT_LAZY(t) \
 	(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
 
+#define INSTR_TIME_SET_CURRENT(t) \
+	((t) = pg_get_ticks())
 
 #define INSTR_TIME_ADD(x,y) \
 	((x).ticks += (y).ticks)
@@ -181,12 +219,18 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SUBTRACT(x,y) \
 	((x).ticks -= (y).ticks)
 
+#define INSTR_TIME_DIFF_NANOSEC(x,y) \
+	(pg_ticks_to_ns((x).ticks - (y).ticks))
+
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	((x).ticks += (y).ticks - (z).ticks)
 
 #define INSTR_TIME_LT(x,y) \
 	((x).ticks > (y).ticks)
 
+#define INSTR_TIME_GET_NANOSEC(t) \
+	(pg_ticks_to_ns((t).ticks))
+
 #define INSTR_TIME_GET_DOUBLE(t) \
 	((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
 
-- 
2.53.0.windows.1

From 25b58d2890e65a95ce426a0b80fab41c1c99bd8f Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 31 Jan 2026 08:49:46 -0800
Subject: [PATCH v8 1/4] Check for HAVE__CPUIDEX and HAVE__GET_CPUID_COUNT
 separately

Previously we would only check for the availability of __cpuidex if
the related __get_cpuid_count was not available on a platform. But there
are cases where we want to be able to call __cpuidex as the only viable
option, specifically, when accessing a high leaf like VM Hypervisor
information (0x40000000), which __get_cpuid_count does not allow.

This will be used in an future commit to access Hypervisor information
about the TSC frequency of x86 CPUs, where available.

Note that __cpuidex is defined in cpuid.h for GCC/clang, but in intrin.h
for MSVC. Because we now set HAVE__CPUIDEX for GCC/clang when available,
adjust existing code to check for _MSC_VER when including intrin.h.

Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
 configure                         | 20 ++++++++++++--------
 configure.ac                      | 30 +++++++++++++++++-------------
 meson.build                       | 10 ++++++++--
 src/port/pg_crc32c_sse42_choose.c |  4 ++--
 src/port/pg_popcount_x86.c        |  4 ++--
 5 files changed, 41 insertions(+), 27 deletions(-)

diff --git a/configure b/configure
index 59894aaa06e..db387e95122 100755
--- a/configure
+++ b/configure
@@ -17709,7 +17709,8 @@ $as_echo "#define HAVE__CPUID 1" >>confdefs.h
   fi
 fi
 
-# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+# Check for __get_cpuid_count() and __cpuidex() separately, since we sometimes
+# need __cpuidex() even if __get_cpuid_count() is available.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __get_cpuid_count" >&5
 $as_echo_n "checking for __get_cpuid_count... " >&6; }
 if ${pgac_cv__get_cpuid_count+:} false; then :
@@ -17742,21 +17743,25 @@ if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
 
 $as_echo "#define HAVE__GET_CPUID_COUNT 1" >>confdefs.h
 
-else
-  # __cpuidex()
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
+fi
+# __cpuidex()
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __cpuidex" >&5
 $as_echo_n "checking for __cpuidex... " >&6; }
 if ${pgac_cv__cpuidex+:} false; then :
   $as_echo_n "(cached) " >&6
 else
   cat confdefs.h - <<_ACEOF >conftest.$ac_ext
 /* end confdefs.h.  */
-#include <intrin.h>
+#ifdef _MSC_VER
+    #include <intrin.h>
+    #else
+    #include <cpuid.h>
+    #endif
 int
 main ()
 {
 unsigned int exx[4] = {0, 0, 0, 0};
-    __cpuidex(exx, 7, 0);
+  __cpuidex(exx, 7, 0);
 
   ;
   return 0;
@@ -17772,11 +17777,10 @@ rm -f core conftest.err conftest.$ac_objext \
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__cpuidex" >&5
 $as_echo "$pgac_cv__cpuidex" >&6; }
-  if test x"$pgac_cv__cpuidex" = x"yes"; then
+if test x"$pgac_cv__cpuidex" = x"yes"; then
 
 $as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h
 
-  fi
 fi
 
 # Check for XSAVE intrinsics
diff --git a/configure.ac b/configure.ac
index 24fad757eb5..dcdcb5d1bce 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2111,7 +2111,8 @@ else
   fi
 fi
 
-# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+# Check for __get_cpuid_count() and __cpuidex() separately, since we sometimes
+# need __cpuidex() even if __get_cpuid_count() is available.
 AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
 [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <cpuid.h>],
   [[unsigned int exx[4] = {0, 0, 0, 0};
@@ -2121,18 +2122,21 @@ AC_CACHE_CHECK([for __get_cpuid_count], [pgac_cv__get_cpuid_count],
   [pgac_cv__get_cpuid_count="no"])])
 if test x"$pgac_cv__get_cpuid_count" = x"yes"; then
   AC_DEFINE(HAVE__GET_CPUID_COUNT, 1, [Define to 1 if you have __get_cpuid_count.])
-else
-  # __cpuidex()
-  AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
-  [AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <intrin.h>],
-    [[unsigned int exx[4] = {0, 0, 0, 0};
-    __cpuidex(exx, 7, 0);
-    ]])],
-    [pgac_cv__cpuidex="yes"],
-    [pgac_cv__cpuidex="no"])])
-  if test x"$pgac_cv__cpuidex" = x"yes"; then
-    AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
-  fi
+fi
+# __cpuidex()
+AC_CACHE_CHECK([for __cpuidex], [pgac_cv__cpuidex],
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([#ifdef _MSC_VER
+    #include <intrin.h>
+    #else
+    #include <cpuid.h>
+    #endif],
+  [[unsigned int exx[4] = {0, 0, 0, 0};
+  __cpuidex(exx, 7, 0);
+  ]])],
+  [pgac_cv__cpuidex="yes"],
+  [pgac_cv__cpuidex="no"])])
+if test x"$pgac_cv__cpuidex" = x"yes"; then
+  AC_DEFINE(HAVE__CPUIDEX, 1, [Define to 1 if you have __cpuidex.])
 fi
 
 # Check for XSAVE intrinsics
diff --git a/meson.build b/meson.build
index ebfb85e93e5..312c919eaa4 100644
--- a/meson.build
+++ b/meson.build
@@ -2080,7 +2080,8 @@ elif cc.links('''
 endif
 
 
-# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
+# Check for __get_cpuid_count() and __cpuidex() separately, since we sometimes
+# need __cpuidex() even if __get_cpuid_count() is available.
 if cc.links('''
     #include <cpuid.h>
     int main(int arg, char **argv)
@@ -2091,8 +2092,13 @@ if cc.links('''
     ''', name: '__get_cpuid_count',
     args: test_c_args)
   cdata.set('HAVE__GET_CPUID_COUNT', 1)
-elif cc.links('''
+endif
+if cc.links('''
+    #ifdef _MSC_VER
     #include <intrin.h>
+    #else
+    #include <cpuid.h>
+    #endif
     int main(int arg, char **argv)
     {
         unsigned int exx[4] = {0, 0, 0, 0};
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index f586476964f..7a75380b483 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,11 +20,11 @@
 
 #include "c.h"
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
 #include <cpuid.h>
 #endif
 
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
 #include <intrin.h>
 #endif
 
diff --git a/src/port/pg_popcount_x86.c b/src/port/pg_popcount_x86.c
index 6bce089432f..840910f3337 100644
--- a/src/port/pg_popcount_x86.c
+++ b/src/port/pg_popcount_x86.c
@@ -14,7 +14,7 @@
 
 #ifdef HAVE_X86_64_POPCNTQ
 
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
 #include <cpuid.h>
 #endif
 
@@ -22,7 +22,7 @@
 #include <immintrin.h>
 #endif
 
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
 #include <intrin.h>
 #endif
 
-- 
2.53.0.windows.1

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to