Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Andres Freund Wed, 01 Apr 2026 17:55:18 -0700

Hi,

Lukas, could you add some preliminary Reviewed-by tags to the
not-yet-committed commitse?



On 2026-03-25 18:17:44 -0700, Lukas Fittl wrote:
> From 7cc7e230e05947892b2aa479eba45144beede48e Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sat, 31 Jan 2026 08:49:46 -0800
> Subject: [PATCH v14 1/6] Check for HAVE__CPUIDEX and HAVE__GET_CPUID_COUNT
>  separately

John, do you want to take this one, or should I take it?


> Subject: [PATCH v14 2/6] pg_test_timing: Reduce per-loop overhead

Pushed, after simplifying the following

> +     INSTR_TIME_ADD_NANOSEC(end_time, duration > 0 ? duration * NS_PER_S : 
> 0);

by removing that ?:, since duration is unsigned and the caller checks for 0
and negative values. And even if a negative value or 0 were to be passed, it'd
still be harmless.



> From 8fbde9831fd89615a41af56917aed9faf619dbbb Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Fri, 25 Jul 2025 17:57:20 -0700
> Subject: [PATCH v14 3/6] instrumentation: Streamline ticks to nanosecond
>  conversion across platforms
> 
> The timing infrastructure (INSTR_* macros) measures time elapsed using
> clock_gettime() on POSIX systems, which returns the time as nanoseconds,
> and QueryPerformanceCounter() on Windows, which is a specialized timing
> clock source that returns a tick counter that needs to be converted to
> nanoseconds using the result of QueryPerformanceFrequency().
> 
> This conversion currently happens ad-hoc on Windows, e.g. when calling
> INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
> invocation, despite the frequency being stable after program start,
> incurring unnecessary overhead. It also causes a fractured implementation
> where macros are defined differently between platforms.
> 
> To ease code readability, and prepare for a future change that intends
> to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
> a new pg_ticks_to_ns() function that gets called on all platforms.
> 
> This function relies on a separately initialized ticks_per_ns_scaled
> value, that represents the conversion ratio. This value is initialized
> from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
> POSIX systems, which results in the ticks being treated as nanoseconds.
> Other architectures always directly return the original ticks.

> To support this, pg_initialize_timing() is introduced, and is now
> mandatory for both the backend and any frontend programs to call before
> utilizing INSTR_* macros.

Are there assertions detecting missing initializations? Ah, yes, there is.


> +/*
> + * Stores what the number of ticks needs to be multiplied with to end up
> + * with nanoseconds using integer math.
> + *
> + * On certain platforms (currently Windows) the ticks to nanoseconds 
> conversion
> + * requires floating point math because:
> + *
> + * sec = ticks / frequency_hz
> + * ns  = ticks / frequency_hz * 1,000,000,000
> + * ns  = ticks * (1,000,000,000 / frequency_hz)
> + * ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
> + *
> + * Here, 'ns' is usually a floating number. For example for a 2.5 GHz CPU
> + * the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
> + *
> + * To be able to use integer math we work around the lack of precision. We
> + * first scale the integer up (left shift by TICKS_TO_NS_SHIFT) and after the
> + * multiplication by the number of ticks in pg_ticks_to_ns() we shift right 
> by
> + * the same amount. We utilize unsigned integers even though ticks are stored
> + * as a signed value to encourage compilers to generate better assembly.
> + *
> + * We remember the maximum number of ticks that can be multiplied by the 
> scale
> + * factor without overflowing so we can check via a * b > max <=> a > max / 
> b.
> + *
> + * On all other platforms we are using clock_gettime(), which uses 
> nanoseconds
> + * as ticks. Hence, we set the multiplier to zero, which causes 
> pg_ticks_to_ns
> + * to return the original value.
> + */
> +uint64               ticks_per_ns_scaled = 0;
> +uint64               max_ticks_no_overflow = 0;
> +bool         timing_initialized = false;

Think it may be worth adding an example for a realistic max_ticks_no_overflow
to show that we almost always are going to be able to avoid the overflow case.


> +static void set_ticks_per_ns(void);
> +
> +void
> +pg_initialize_timing(void)
> +{
> +     if (timing_initialized)
> +             return;
> +
> +     set_ticks_per_ns();
> +     timing_initialized = true;
> +}
> +
> +#ifndef WIN32
> +
> +static void
> +set_ticks_per_ns(void)
> +{
> +     ticks_per_ns_scaled = 0;
> +     max_ticks_no_overflow = 0;
> +}
> +
> +#else                                                        /* WIN32 */
> +
> +/* GetTimerFrequency returns counts per second */
> +static inline double
> +GetTimerFrequency(void)
> +{
> +     LARGE_INTEGER f;
> +
> +     QueryPerformanceFrequency(&f);
> +     return (double) f.QuadPart;
> +}
> +
> +static void
> +set_ticks_per_ns(void)
> +{
> +     ticks_per_ns_scaled = (NS_PER_S << TICKS_TO_NS_SHIFT) / 
> GetTimerFrequency();
> +     max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
> +}
> +
> +#endif                                                       /* WIN32 */

Seems a tiny bit odd to introduce them here with a name to then evolve the
naming pattern in the subsequent patches.

> +static inline int64
> +pg_ticks_to_ns(int64 ticks)
>  {
> -     LARGE_INTEGER f;
> +#if PG_INSTR_TICKS_TO_NS
> +     int64           ns = 0;
> +
> +     Assert(timing_initialized);
> +
> +     /*
> +      * Avoid doing work if we don't use scaled ticks, e.g. system clock on
> +      * Unix
> +      */
> +     if (ticks_per_ns_scaled == 0)
> +             return ticks;
> +
> +     /*
> +      * Would multiplication overflow? If so perform computation in two 
> parts.
> +      */
> +     if (unlikely(ticks > (int64) max_ticks_no_overflow))
> +     {
> +             /*
> +              * To avoid overflow, first scale total ticks down by the fixed
> +              * factor, and *afterwards* multiply them by the 
> frequency-based scale
> +              * factor.
> +              *
> +              * The remaining ticks can follow the regular formula, since 
> they
> +              * won't overflow.
> +              */
> +             int64           count = ticks >> TICKS_TO_NS_SHIFT;
> +
> +             ns = count * ticks_per_ns_scaled;
> +             ticks -= (count << TICKS_TO_NS_SHIFT);
> +     }
> +
> +     ns += (ticks * ticks_per_ns_scaled) >> TICKS_TO_NS_SHIFT;
> +
> +     return ns;
> +#else
> +     Assert(timing_initialized);
>  
> -     QueryPerformanceFrequency(&f);
> -     return (double) f.QuadPart;
> +     return ticks;
> +#endif                                                       /* 
> PG_INSTR_TICKS_TO_NS */

Kinda wondering what the best way would be to verify that the overflow case
actually works.  What about adding a regress.c test that checks that we get
the right results for something like

   INSTR_TIME_SET_ZERO(t);
   INSTR_TIME_ADD_NANOSEC(t, some_number);
   EXPECT_EQ_U64(INSTR_TIME_GET_NANOSEC(t), some_number);

for a few different some_numbers?



> Subject: [PATCH v14 4/6] instrumentation: Use Time-Stamp Counter (TSC) on
>  x86-64 for faster measurements

> +bool
> +check_timing_clock_source(int *newval, void **extra, GucSource source)
> +{
> +     /*
> +      * Ensure timing is initialized. On Windows (EXEC_BACKEND), GUC hooks 
> can
> +      * be called during InitializeGUCOptions() before InitProcessGlobals() 
> has
> +      * had a chance to run pg_initialize_timing().
> +      */
> +     pg_initialize_timing();

So doesn't that mean the effort to synchronize the tsc freq on EXEC_BACKEND is 
futile?


> diff --git a/src/bin/pg_test_timing/pg_test_timing.c 
> b/src/bin/pg_test_timing/pg_test_timing.c
> index 1d9ee4fb588..ee0e3a3b0ab 100644
> --- a/src/bin/pg_test_timing/pg_test_timing.c
> +++ b/src/bin/pg_test_timing/pg_test_timing.c
> @@ -43,7 +43,9 @@ main(int argc, char *argv[])
>  
>       handle_args(argc, argv);
>  
> -     /* initialize timing infrastructure (required for INSTR_* calls) */
> +     /*
> +      * Initialize timing infrastructure (required for INSTR_* calls)
> +      */
>       pg_initialize_timing();
>  
>       loop_count = test_timing(test_duration);

This seems like a superfluous change?


> +bool
> +pg_set_timing_clock_source(TimingClockSourceType source)
> +{
> +     Assert(timing_initialized);
> +
> +#if PG_INSTR_TSC_CLOCK
> +     pg_initialize_timing_tsc();
> +#endif
> +
> +#if PG_INSTR_TSC_CLOCK
> +     switch (source)
> +     {
> +             case TIMING_CLOCK_SOURCE_AUTO:
> +                     use_tsc = (tsc_frequency_khz > 0) && 
> tsc_use_by_default();

A global named use_tsc seems a tad too short a name for my personal taste. Too
likely to also be used by something else.


> +int32                tsc_frequency_khz = -1;
> +
> +static uint32 tsc_calibrate(void);
> +
> +/*
> + * Detect the TSC frequency and whether RDTSCP is available on x86-64.
> + *
> + * This can't be reliably determined at compile time, since the
> + * availability of an "invariant" TSC (that is not affected by CPU
> + * frequency changes) is dependent on the CPU architecture. Additionally,
> + * there are cases where TSC availability is impacted by virtualization,
> + * where a simple cpuid feature check would not be enough.
> + */
> +static void
> +tsc_detect_frequency(void)
> +{
> +     tsc_frequency_khz = 0;

So we're using -1 for uninitialized and 0 for unknown? If so we should
document that.


> +     /* We require RDTSCP support, bail if not available */
> +     if (!x86_feature_available(PG_RDTSCP))
> +             return;
> +
> +     /* Determine speed at which the TSC advances */
> +     tsc_frequency_khz = x86_tsc_frequency_khz();
> +     if (tsc_frequency_khz > 0)
> +             return;
> +
> +     /*
> +      * CPUID did not give us the TSC frequency. If TSC is invariant and 
> RDTSCP
> +      * is available, we can measure the frequency by comparing TSC ticks
> +      * against walltime using a short calibration loop.
> +      */
> +     if (x86_feature_available(PG_TSC_INVARIANT))
> +             tsc_frequency_khz = tsc_calibrate();
> +}

Why do we not need to always check if the TSC is invariant? Seems like a hard
requirement to me?



> +/*
> + * Calibrate the TSC frequency by comparing TSC ticks against walltime.
> + *
> + * Takes initial TSC and system clock snapshots, then loops, recomputing the
> + * frequency each TSC_CALIBRATION_SKIPS iterations from cumulative TSC
> + * ticks divided by elapsed time.
> + *
> + * Once the frequency estimate stabilizes (consecutive iterations agree), we
> + * consider it converged and the frequency in KHz is returned. If either too
> + * many iterations or a time limit passes without convergence, 0 is returned.
> + */
> +#define TSC_CALIBRATION_MAX_NS               (50 * NS_PER_MS)
> +#define TSC_CALIBRATION_ITERATIONS   1000000
> +#define TSC_CALIBRATION_SKIPS                100
> +#define TSC_CALIBRATION_STABLE_CYCLES        10
> +

> +static uint32
> +tsc_calibrate(void)
> +{
> +     instr_time      initial_wall;
> +     int64           initial_tsc;
> +     double          freq_khz = 0;
> +     double          prev_freq_khz = 0;
> +     int                     stable_count = 0;
> +     int64           prev_tsc;
> +     uint32          unused;
> +
> +     /* Ensure INSTR_* time below work on system time */
> +     set_ticks_per_ns_system();
> +
> +     INSTR_TIME_SET_CURRENT(initial_wall);
> +
> +#ifdef _MSC_VER
> +     initial_tsc = __rdtscp(&unused);
> +#else
> +     initial_tsc = __builtin_ia32_rdtscp(&unused);
> +#endif
> +     prev_tsc = initial_tsc;
> +
> +     for (int i = 0; i < TSC_CALIBRATION_ITERATIONS; i++)
> +     {
> +             instr_time      now_wall;
> +             int64           now_tsc;
> +             int64           elapsed_ns;
> +             int64           elapsed_ticks;
> +
> +             INSTR_TIME_SET_CURRENT(now_wall);
> +
> +#ifdef _MSC_VER
> +             now_tsc = __rdtscp(&unused);
> +#else
> +             now_tsc = __builtin_ia32_rdtscp(&unused);
> +#endif

I'd put this in a helper, given that we already need it twice, and we might
later need to extend it for other architectures.


> +             /*
> +              * Skip if this is not the Nth cycle where we measure, if TSC 
> hasn't
> +              * advanced, or we walked backwards for some reason.
> +              */
> +             if (i % TSC_CALIBRATION_SKIPS != 0 || now_tsc == prev_tsc || 
> elapsed_ns <= 0 || elapsed_ticks <= 0)
> +                     continue;

What's the goal of TSC_CALIBRATION_SKIPS?



> +             freq_khz = ((double) elapsed_ticks / elapsed_ns) * 1000 * 1000;
> +
> +             /*
> +              * Once freq_khz / prev_freq_khz is small, check if it stays 
> that way.
> +              * If it does for long enough, we've got a winner frequency.
> +              */
> +             if (prev_freq_khz != 0 && fabs(1 - freq_khz / prev_freq_khz) < 
> 0.0001)
> +             {
> +                     stable_count++;
> +                     if (stable_count >= TSC_CALIBRATION_STABLE_CYCLES)
> +                             return (uint32) freq_khz;
> +             }
> +             else
> +                     stable_count = 0;
> +
> +             prev_tsc = now_tsc;
> +             prev_freq_khz = freq_khz;
> +     }
> +
> +     /* did not converge */
> +     return 0;
> +}

I think it may be worth teaching pg_test_timing to display the tsc frequency
and show the frequency determined by calibration even if we got it via
cpuid. Otherwise it'll be hard to debug this.


> +/*
> + * Initialize the TSC clock source by determining its usability and 
> frequency.
> + *
> + * This can be called multiple times, as tsc_frequency_khz will be set to 0
> + * if a prior call determined the TSC is not usable. On EXEC_BACKEND 
> (Windows),
> + * the TSC frequency may also be set by restore_backend_variables.
> + */
> +void
> +pg_initialize_timing_tsc(void)
> +{
> +     if (tsc_frequency_khz < 0)
> +             tsc_detect_frequency();
> +}

Maybe we should add a debug log with the frequency in the < 0 case?

> +uint32
> +x86_tsc_frequency_khz(void)
> +{
> +     unsigned int r[4] = {0};
> +
> +     if (!x86_feature_available(PG_TSC_INVARIANT))
> +             return 0;
> +
> +     if (x86_feature_available(PG_HYPERVISOR))
> +             return x86_hypervisor_tsc_frequency_khz();
> +
> +     /*
> +      * On modern Intel CPUs, the TSC is implemented by invariant timekeeping
> +      * hardware, also called "Always Running Timer", or ART. The ART stays
> +      * consistent even if the CPU changes frequency due to changing power
> +      * levels.
> +      *
> +      * As documented in "Determining the Processor Base Frequency" in the
> +      * "Intel® 64 and IA-32 Architectures Software Developer's Manual",
> +      * February 2026 Edition, we can get the TSC frequency as follows:
> +      *
> +      * Nominal TSC frequency = ( CPUID.15H:ECX[31:0] * CPUID.15H:EBX[31:0] 
> ) /
> +      * CPUID.15H:EAX[31:0]
> +      *
> +      * With CPUID.15H:ECX representing the nominal core crystal clock
> +      * frequency, and EAX/EBX representing values used to translate the TSC
> +      * value to that frequency, see "Chapter 20.17 "Time-Stamp Counter" of
> +      * that manual.
> +      *
> +      * Older Intel CPUs, and other vendors do not set CPUID.15H:ECX, and as
> +      * such we fall back to alternate approaches.
> +      */
> +     pg_cpuid(0x15, r);
> +     if (r[ECX] > 0)
> +     {
> +             /*
> +              * EBX not being set indicates invariant TSC is not available. 
> Require
> +              * EAX being non-zero too, to avoid a theoretical divide by 
> zero.
> +              */
> +             if (r[EAX] == 0 || r[EBX] == 0)
> +                     return 0;
> +
> +             return r[ECX] / 1000 * r[EBX] / r[EAX];
> +     }
> +
> +     /*
> +      * When CPUID.15H is not available/incomplete, but we have verified an
> +      * invariant TSC is used, we can instead get the processor base 
> frequency
> +      * in MHz from CPUID.16H:EAX, the "Processor Frequency Information 
> Leaf".
> +      */

s/can instead/can possibly instead/

As it's also optional...



> From 44292c17075264c659e03d784a6c7619e4c5bd3e Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Tue, 10 Mar 2026 01:38:14 -0700
> Subject: [PATCH v14 6/6] instrumentation: ARM support for fast time
>  measurements
> 
> Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces
> use of the cntvct_el0 instruction on ARM systems to access the generic
> timer that provides a synchronized ticks value across CPUs.
> 
> Note this adds an exception for Apple Silicon CPUs, due to the observed
> fact that M3 and newer has different timer frequencies for the Efficiency
> and the Performance cores, and we can't be sure where we get scheduled.

There are other such SOCs with different clock speeds for
performance/efficiency cores, I'm a bit worried that just denylisting Apple
Silicon will end up allowing other such systems.


Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to