Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Lukas Fittl Wed, 25 Feb 2026 02:01:31 -0800

Hi Andres, thanks for the detailed review!

And thanks to David for testing on Windows (and patching what's
needed), great to have the confirmation this works there too.


See attached v9, rebased, with feedback addressed, unless otherwise noted.

0001 is the updated cpuidex patch, and should be good to go (though we
could also merge this into the TSC commit, if you prefer)
0002 is the INSTR_TIME_SET_CURRENT_LAZY removal pulled forward
0003 is the INSTR_TIME_LT => INSTR_TIME_GT naming fix from [0] because
I assume we'll push that shortly, and 0004 now uses that macro
0004 is the pg_test_timing change split out, with an additional
optimization (see below)
0005 is the same as v8/0002 (Streamline ticks to nanosecond
conversion), but needs some more brain energy to address your
suggestions (or feel free to drive that forward if you'd like)
0006 and 0007 are the two TSC related changes, with feedback
addressed, except for documenting the cpuid calls better (added TODO)

On Mon, Feb 23, 2026 at 2:28 PM Andres Freund <[email protected]> wrote:
> > diff --git a/meson.build b/meson.build
> > index ebfb85e93e5..312c919eaa4 100644
> > --- a/meson.build
> > +++ b/meson.build
> ...
> >      int main(int arg, char **argv)
> >      {
> >          unsigned int exx[4] = {0, 0, 0, 0};
>
> FWIW, this seems to trigger a warning locally:

Good catch, adjusted for both Meson and make - it looks like that was
wrong before already, since even MSVC defines this as a signed integer
[1].

> > From 2392d95626599a1b5562f9216eb8c334db99c932 Mon Sep 17 00:00:00 2001
> > From: Lukas Fittl <[email protected]>
> > Date: Fri, 25 Jul 2025 17:57:20 -0700
> > Subject: [PATCH v8 2/4] Timing: Streamline ticks to nanosecond conversion
> >  across platforms
> >
> > ...
> > To support this, pg_initialize_timing() is introduced, and is now
> > mandatory for both the backend and any frontend programs to call before
> > utilizing INSTR_* macros.
>
> I wonder if it's worth trying to transparently initialize in the overflow
> codepath. Probably not, but worth explicitly considering.

If I follow, you're thinking of something like:

- Initialize max_ticks_no_overflow to 0 by default
- In the overflow path (which we'd reach the first time around), do an
extra check if max_ticks_no_overflow == 0, and then call the
initialization function
- The initialization function sets max_ticks_no_overflow to a non-zero
value, so we don't get there the second time around

Is that right? (I think it could work, since its "just" an extra jump
instruction in an unlikely edge case)

> > diff --git a/src/bin/pg_test_timing/pg_test_timing.c 
> > b/src/bin/pg_test_timing/pg_test_timing.c
> > index a5621251afc..9fd630a490a 100644
>
> > @@ -182,9 +184,8 @@ test_timing(unsigned int duration)
> >                                       bits;
> >
> >               prev = cur;
> > -             INSTR_TIME_SET_CURRENT(temp);
> > -             cur = INSTR_TIME_GET_NANOSEC(temp);
> > -             diff = cur - prev;
> > +             INSTR_TIME_SET_CURRENT(cur);
> > +             diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
> >
> >               /* Did time go backwards? */
> >               if (unlikely(diff < 0))
>
> FWIW, I don't think this needs a special INSTR_TIME macro, it could just use
> INSTR_TIME_SUBTRACT() and INSTR_TIME_GET_NANOSEC().
>

I think that makes sense, but FWIW, its a bit inconvenient since
INSTR_TIME_SUBTRACT modifies the first argument (its "x -= y", not "x
- y"), and we re-use "cur" later. But we can just make a copy of "cur"
first that we pass to the macro.

Additionally, I've now adjusted this to calculate the target end time
when starting up test_timing, and simply compare the current time
against that (with INSTR_TIME_GT), avoiding a INSTR_TIME_SUBTRACT and
INSTR_TIME_GET_NANOSEC in the hot loop. That does add another macro
though (INSTR_TIME_SET_NANOSEC, to be able to initialize instr_time
from a user-defined interval value), but I think that's worth it.

> > +
> > +static void
> > +set_ticks_per_ns()
> > +{
> > +     ticks_per_ns_scaled = INT64CONST(1000000000) * TICKS_TO_NS_PRECISION 
> > / GetTimerFrequency();
>
>
> This should probably use NS_PER_S.

Done.

> I wonder whether we should use an explicit shift here and in pg_ticks_to_ns(),
> to avoid having to rely on the compiler to do so for us.

I've left this as-is for now since I lacked the brain space to write
out the shift logic - but fine doing this either way.

> > +     /*
> > +      * Would multiplication overflow? If so perform computation in two 
> > parts.
> > +      * Check overflow without actually overflowing via: a * b > max <=> a 
> > >
> > +      * max / b
> > +      */
> > +     if (unlikely(ticks > (int64) max_ticks_no_overflow))
>
> The "via" comment seems a bit misplaced, given that the transformation is not
> really utilized here (but at the point where max_ticks_no_overflow) is
> computed.

Hmm, yeah, I see your point. I wonder if we should move the "via ..."
part to the comment that's at the top of instr_time.c?

> > +     {
> > +             /*
> > +              * Compute how often the maximum number of ticks fits 
> > completely into
> > +              * the number of elapsed ticks and convert that number into
> > +              * nanoseconds. Then multiply by the count to arrive at the 
> > final
> > +              * value. In a 2nd step we adjust the number of elapsed ticks 
> > and
> > +              * convert the remaining ticks.
> > +              */
> > +             int64           count = ticks / max_ticks_no_overflow;
> > +             int64           max_ns = max_ticks_no_overflow * 
> > ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
> > +
> > +             ns = max_ns * count;
> > +
> > +             /*
> > +              * Subtract the ticks that we now already accounted for, so 
> > that they
> > +              * don't get counted twice.
> > +              */
> > +             ticks -= count * max_ticks_no_overflow;
> > +             Assert(ticks >= 0);
>
> I think we could perhaps make the overflow case a good bit cheaper, by
> avoiding any divisions with a non-constant factor (assuming I haven't blown
> the logic below).  Instead of doing a division we can "transform back" into
> the non-scaled representation, I think?
>
> ns = (ticks * ticks_per_ns_scaled) / TICKS_TO_NS_PRECISION
>
>   equals, assuming arbitrary precision
>
> ns = (ticks / TICKS_TO_NS_PRECISION) * ticks_per_ns_scaled
>
>   and not assuming arbitrary precision:
>
> count = ticks // TICKS_TO_NS_PRECISION
> rem_ticks = ticks - (count * TICKS_TO_NS_PRECISION)
> ns = count * ticks_per_ns_scaled + rem_ticks * ticks_per_ns_scaled // 
> TICKS_TO_NS_PRECISION
>
> None of which afaict would overflow?

I've left this as is for now since I didn't write the original logic
here (I think it was you in a prior version?), and I need a good
night's sleep to think through this. Additional help welcome to review
your proposal.

> > +{
> > +#if defined(__linux__)
> > +     FILE       *fp = 
> > fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", 
> > "r");
> > +     char            buf[128];
> > +
> > +     if (!fp)
> > +             return false;
> > +
> > +     if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
> > +     {
> > +             fclose(fp);
> > +             return true;
> > +     }
> > +
> > +     fclose(fp);
> > +#endif
> > +
> > +     return false;
> > +}
>
> I think this will often disable tsc on VMs, due to linux defaulting to
> kvm-clock in KVM VMs.
>
> Do we care about that?
>
>
> If the tsc is not actually viable, is it still listed in
> /sys/devices/system/clocksource/clocksource0/available_clocksource
> ?

I think unless we want to do additional checks ourselves (something
like in [2]), we need to be careful here, and can't rely on the
presence of "tsc" in available clock sources to mean its viable.

Specifically, my understanding is that the Kernel lists "tsc" as
available in more cases, and then if chosen in the beginning, has a
watchdog logic that observes the TSC and modifies it as needed if its
not viable. I think in such cases "tsc" would continue to be listed as
available, but the Kernel would have notified the user in the kernel
log that TSC is unstable.

>
> > +
> > +#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] 
> > == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
> > +#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 
> > 0x564b4d56 && words[3] == 0x0000004d)     /* KVMKVMKVM */
> > +
> > +static bool
> > +set_tsc_frequency_khz()
> > +{
> > +     uint32          r[4] = {0, 0, 0, 0};
> > +
> > +#if defined(HAVE__GET_CPUID)
> > +     __get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , 
> > &r[2] /* hz */ , &r[3]);
> > +#elif defined(HAVE__CPUID)
> > +     __cpuid(r, 0x15);
> > +#else
> > +#error cpuid instruction not available
> > +#endif
> > +
> > +     if (r[2] > 0)
> > +     {
> > +             if (r[0] == 0 || r[1] == 0)
> > +                     return false;
> > +
> > +             tsc_frequency_khz = r[2] / 1000 * r[1] / r[0];
> > +             return true;
> > +     }
>
> I think there should be some explanation about what this is testing.
> Including perhaps a reference to the relevant documents.
>
>
>
> > +     /* Some CPUs only report frequency in 16H */
>
> Dito.
>

Agreed - I've left this for a future revision to spell out, but added
a TODO for now so we don't forget.

>
> > @@ -93,13 +95,54 @@ typedef struct instr_time
> >  extern PGDLLIMPORT uint64 ticks_per_ns_scaled;
> >  extern PGDLLIMPORT uint64 max_ticks_no_overflow;
> >
> > +#if defined(__x86_64__) || defined(_M_X64)
> > +#include <immintrin.h>
>
> Why do we need to include immintrin.h in instr_time.h?  Including immintrin.h
> makes compilation a lot slower:

We previously had x86intrin.h there, I think David changed that to
immintrin.h in v8. I've adjusted this to use intrin.h on MSVC instead,
as that's noted as the correct file to include from [3], and back to
x86intrin.h for other platforms.

>
> >  int
> > @@ -46,10 +46,47 @@ main(int argc, char *argv[])
> >       /* initialize timing infrastructure (required for INSTR_* calls) */
> >       pg_initialize_timing();
> >
> > -     loop_count = test_timing(test_duration);
> > -
> > +     /*
> > +      * First, test default (non-fast) timing code. A clock source for 
> > that is
> > +      * always available. Hence, we can unconditionally output the result.
> > +      */
> > +     loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_SYSTEM, 
> > false);
> >       output(loop_count);
> >
> > +#if defined(__x86_64__) || defined(_M_X64)
>
> I don't love that now test_timing.c has architecture specific checks.  Could
> we abstract this a bit more?

That's fair. I think it might be best if we introduce a new define
that controls whether we're on an architecture that supports the TSC
logic.

I've added PG_INSTR_TSC_CLOCK for that purpose (naming feedback
welcome), and converted most "defined(__x86_64__) || defined(_M_X64)"
to utilize that instead, including the one in pg_test_timing.c. I've
also removed direct "RDTSC" and "RTDSCP" clock source names in
pg_test_timing.c and added separate defines for that, in the
theoretical case we add support for TSC-like mechanisms on another
architecture.

> > +     /*
> > +      * If on a supported architecture, test the RDTSC clock source. This 
> > clock
> > +      * source is not always available. In that case the loop count will 
> > be 0
> > +      * and we don't print.
> > +      *
> > +      * We first emit RDTSCP timings, which is slower, and gets used for 
> > higher
> > +      * precision measurements when the TSC clock source is enabled. We 
> > emit
> > +      * RDTSC second, which is used for faster timing measurements with 
> > lower
> > +      * precision.
> > +      */
> > +     printf("\n");
> > +     loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, 
> > false);
> > +     if (loop_count > 0)
> > +     {
> > +             output(loop_count);
> > +             printf("\n");
> > +
> > +             /* Now, emit fast timing measurements */
> > +             loop_count = test_timing(test_duration, 
> > TIMING_CLOCK_SOURCE_TSC, true);
> > +             output(loop_count);
> > +             printf("\n");
> > +
> > +             pg_set_timing_clock_source(TIMING_CLOCK_SOURCE_AUTO);
> > +             if (pg_current_timing_clock_source() == 
> > TIMING_CLOCK_SOURCE_TSC)
> > +                     printf(_("TSC clock source will be used by default, 
> > unless timing_clock_source is set to 'system'.\n"));
> > +             else
> > +                     printf(_("TSC clock source will not be used by 
> > default, unless timing_clock_source is set to 'tsc'.\n"));
> > +     }
> > +     else
> > +             printf(_("TSC clock source is not usable. Likely unable to 
> > determine TSC frequency. are you running in an unsupported virtualized 
> > environment?.\n"));
> > +#endif
> > +
>
> A bit weird that most of the output stuff is handled in output(), but then
> some of it is handled directly in main() now, some of it in test_timing().

Hmm. I kind of see what you mean. I wonder if the main oddity there is
having lots of TSC logic in the main function, when other stuff lives
later in the file.

To explore an alternate structure, I've added test_system_timing() and
test_tsc_timing() methods, so its better abstracted. The new TSC
printfs are still directly in test_tsc_timing (moved from main),
because I don't see an easy way to have these happen in output(). We
could consider renaming "output" to "output_timings" for further
clarity?

Thanks,
Lukas

[0]: 
https://www.postgresql.org/message-id/flat/CAP53PkzGbyeJMLDAcvMRgzXPXYsYXZr3SBg0UwhfkYjqu8oK_g%40mail.gmail.com#7e007d1c3769c2755f0d98fa7f8b048a
[1]: 
https://learn.microsoft.com/en-us/cpp/intrinsics/cpuid-cpuidex?view=msvc-170
[2]: https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c#L1268
[3]: https://learn.microsoft.com/en-us/cpp/intrinsics/rdtsc?view=msvc-170

--
Lukas Fittl

v9-0001-Check-for-HAVE__CPUIDEX-and-HAVE__GET_CPUID_COUNT.patch
Description: Binary data

v9-0002-instrumentation-Drop-INSTR_TIME_SET_CURRENT_LAZY-.patch
Description: Binary data

v9-0003-instrumentation-Rename-INSTR_TIME_LT-macro-to-INS.patch
Description: Binary data

v9-0005-instrumentation-Streamline-ticks-to-nanosecond-co.patch
Description: Binary data

v9-0004-pg_test_timing-Reduce-per-loop-overhead.patch
Description: Binary data

v9-0006-instrumentation-Use-Time-Stamp-Counter-TSC-on-x86.patch
Description: Binary data

v9-0007-pg_test_timing-Also-test-RDTSC-RDTSCP-timing-and-.patch
Description: Binary data

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to