Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Andres Freund Thu, 26 Feb 2026 08:14:44 -0800

Hi,

On 2026-02-25 02:00:35 -0800, Lukas Fittl wrote:
> Hi Andres, thanks for the detailed review!
>
> And thanks to David for testing on Windows (and patching what's
> needed), great to have the confirmation this works there too.
>
> See attached v9, rebased, with feedback addressed, unless otherwise noted.
>
> 0001 is the updated cpuidex patch, and should be good to go (though we
> could also merge this into the TSC commit, if you prefer)


John, does this interact with your work at all?


> 0002 is the INSTR_TIME_SET_CURRENT_LAZY removal pulled forward
> 0003 is the INSTR_TIME_LT => INSTR_TIME_GT naming fix from [0] because
> I assume we'll push that shortly, and 0004 now uses that macro

Pushed these two.



> > > From 2392d95626599a1b5562f9216eb8c334db99c932 Mon Sep 17 00:00:00 2001
> > > From: Lukas Fittl <[email protected]>
> > > Date: Fri, 25 Jul 2025 17:57:20 -0700
> > > Subject: [PATCH v8 2/4] Timing: Streamline ticks to nanosecond conversion
> > >  across platforms
> > >
> > > ...
> > > To support this, pg_initialize_timing() is introduced, and is now
> > > mandatory for both the backend and any frontend programs to call before
> > > utilizing INSTR_* macros.
> >
> > I wonder if it's worth trying to transparently initialize in the overflow
> > codepath. Probably not, but worth explicitly considering.
>
> If I follow, you're thinking of something like:
>
> - Initialize max_ticks_no_overflow to 0 by default
> - In the overflow path (which we'd reach the first time around), do an
> extra check if max_ticks_no_overflow == 0, and then call the
> initialization function
> - The initialization function sets max_ticks_no_overflow to a non-zero
> value, so we don't get there the second time around
>
> Is that right? (I think it could work, since its "just" an extra jump
> instruction in an unlikely edge case)

Yes, that's what I was wondering about.


> > > diff --git a/src/bin/pg_test_timing/pg_test_timing.c 
> > > b/src/bin/pg_test_timing/pg_test_timing.c
> > > index a5621251afc..9fd630a490a 100644
> >
> > > @@ -182,9 +184,8 @@ test_timing(unsigned int duration)
> > >                                       bits;
> > >
> > >               prev = cur;
> > > -             INSTR_TIME_SET_CURRENT(temp);
> > > -             cur = INSTR_TIME_GET_NANOSEC(temp);
> > > -             diff = cur - prev;
> > > +             INSTR_TIME_SET_CURRENT(cur);
> > > +             diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
> > >
> > >               /* Did time go backwards? */
> > >               if (unlikely(diff < 0))
> >
> > FWIW, I don't think this needs a special INSTR_TIME macro, it could just use
> > INSTR_TIME_SUBTRACT() and INSTR_TIME_GET_NANOSEC().
> >
>
> I think that makes sense, but FWIW, its a bit inconvenient since
> INSTR_TIME_SUBTRACT modifies the first argument (its "x -= y", not "x
> - y"), and we re-use "cur" later. But we can just make a copy of "cur"
> first that we pass to the macro.

Exactly.


> Additionally, I've now adjusted this to calculate the target end time
> when starting up test_timing, and simply compare the current time
> against that (with INSTR_TIME_GT), avoiding a INSTR_TIME_SUBTRACT and
> INSTR_TIME_GET_NANOSEC in the hot loop. That does add another macro
> though (INSTR_TIME_SET_NANOSEC, to be able to initialize instr_time
> from a user-defined interval value), but I think that's worth it.

Agreed.



> > > +     /*
> > > +      * Would multiplication overflow? If so perform computation in two 
> > > parts.
> > > +      * Check overflow without actually overflowing via: a * b > max <=> 
> > > a >
> > > +      * max / b
> > > +      */
> > > +     if (unlikely(ticks > (int64) max_ticks_no_overflow))
> >
> > The "via" comment seems a bit misplaced, given that the transformation is 
> > not
> > really utilized here (but at the point where max_ticks_no_overflow) is
> > computed.
>
> Hmm, yeah, I see your point. I wonder if we should move the "via ..."
> part to the comment that's at the top of instr_time.c?

Yea, probably some central part documenting this makes sense.


> > > +     {
> > > +             /*
> > > +              * Compute how often the maximum number of ticks fits 
> > > completely into
> > > +              * the number of elapsed ticks and convert that number into
> > > +              * nanoseconds. Then multiply by the count to arrive at the 
> > > final
> > > +              * value. In a 2nd step we adjust the number of elapsed 
> > > ticks and
> > > +              * convert the remaining ticks.
> > > +              */
> > > +             int64           count = ticks / max_ticks_no_overflow;
> > > +             int64           max_ns = max_ticks_no_overflow * 
> > > ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
> > > +
> > > +             ns = max_ns * count;
> > > +
> > > +             /*
> > > +              * Subtract the ticks that we now already accounted for, so 
> > > that they
> > > +              * don't get counted twice.
> > > +              */
> > > +             ticks -= count * max_ticks_no_overflow;
> > > +             Assert(ticks >= 0);
> >
> > I think we could perhaps make the overflow case a good bit cheaper, by
> > avoiding any divisions with a non-constant factor (assuming I haven't blown
> > the logic below).  Instead of doing a division we can "transform back" into
> > the non-scaled representation, I think?
> >
> > ns = (ticks * ticks_per_ns_scaled) / TICKS_TO_NS_PRECISION
> >
> >   equals, assuming arbitrary precision
> >
> > ns = (ticks / TICKS_TO_NS_PRECISION) * ticks_per_ns_scaled
> >
> >   and not assuming arbitrary precision:
> >
> > count = ticks // TICKS_TO_NS_PRECISION
> > rem_ticks = ticks - (count * TICKS_TO_NS_PRECISION)
> > ns = count * ticks_per_ns_scaled + rem_ticks * ticks_per_ns_scaled // 
> > TICKS_TO_NS_PRECISION
> >
> > None of which afaict would overflow?
>
> I've left this as is for now since I didn't write the original logic
> here (I think it was you in a prior version?), and I need a good
> night's sleep to think through this. Additional help welcome to review
> your proposal.

i don't remember writing the logic, but that doesn't say much :)


> > I think this will often disable tsc on VMs, due to linux defaulting to
> > kvm-clock in KVM VMs.
> >
> > Do we care about that?
> >
> >
> > If the tsc is not actually viable, is it still listed in
> > /sys/devices/system/clocksource/clocksource0/available_clocksource
> > ?
>
> I think unless we want to do additional checks ourselves (something
> like in [2]), we need to be careful here, and can't rely on the
> presence of "tsc" in available clock sources to mean its viable.
>
> Specifically, my understanding is that the Kernel lists "tsc" as
> available in more cases, and then if chosen in the beginning, has a
> watchdog logic that observes the TSC and modifies it as needed if its
> not viable. I think in such cases "tsc" would continue to be listed as
> available, but the Kernel would have notified the user in the kernel
> log that TSC is unstable.

We probably should verify that the kernel indeed behaves that way, otherwise
far fewer people will benefit from this improvement.



> > > @@ -93,13 +95,54 @@ typedef struct instr_time
> > >  extern PGDLLIMPORT uint64 ticks_per_ns_scaled;
> > >  extern PGDLLIMPORT uint64 max_ticks_no_overflow;
> > >
> > > +#if defined(__x86_64__) || defined(_M_X64)
> > > +#include <immintrin.h>
> >
> > Why do we need to include immintrin.h in instr_time.h?  Including 
> > immintrin.h
> > makes compilation a lot slower:
>
> We previously had x86intrin.h there, I think David changed that to
> immintrin.h in v8. I've adjusted this to use intrin.h on MSVC instead,
> as that's noted as the correct file to include from [3], and back to
> x86intrin.h for other platforms.

My concern is that instr_time.h is quite widely included, including through
executor/instrument.h and pgstat.h.

An -O0 build without including immintrin.h:
  $ ninja clean && rm -f .ninja_* && CCACHE_DISABLE=1 time ninja
  297.05user 53.54system 0:21.53elapsed 1628%CPU (0avgtext+0avgdata 
493044maxresident)k
  584inputs+4215576outputs (36major+13819907minor)pagefaults 0swaps

just adding #include <immintrin.h> to instr_time.h:
  $ ninja clean && rm -f .ninja_* && CCACHE_DISABLE=1 time ninja
  529.83user 81.39system 0:47.85elapsed 1277%CPU (0avgtext+0avgdata 
585492maxresident)k
  3504inputs+5232544outputs (31major+23905659minor)pagefaults 0swaps

I.e. the elapsed build time more than doubled.  That seems problematic to me.


I think we could either:

a) Avoid the expensive include, e.g. by including a narrower header or by just
   using the underlying builtin directly.

b) Introduce a separate header just defining the instr_time type, which then
   is included in headers like executor/instrument.h, pgstat.h, where we just
   need the type.


> > > +     /*
> > > +      * If on a supported architecture, test the RDTSC clock source. 
> > > This clock
> > > +      * source is not always available. In that case the loop count will 
> > > be 0
> > > +      * and we don't print.
> > > +      *
> > > +      * We first emit RDTSCP timings, which is slower, and gets used for 
> > > higher
> > > +      * precision measurements when the TSC clock source is enabled. We 
> > > emit
> > > +      * RDTSC second, which is used for faster timing measurements with 
> > > lower
> > > +      * precision.
> > > +      */
> > > +     printf("\n");
> > > +     loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, 
> > > false);
> > > +     if (loop_count > 0)
> > > +     {
> > > +             output(loop_count);
> > > +             printf("\n");
> > > +
> > > +             /* Now, emit fast timing measurements */
> > > +             loop_count = test_timing(test_duration, 
> > > TIMING_CLOCK_SOURCE_TSC, true);
> > > +             output(loop_count);
> > > +             printf("\n");
> > > +
> > > +             pg_set_timing_clock_source(TIMING_CLOCK_SOURCE_AUTO);
> > > +             if (pg_current_timing_clock_source() == 
> > > TIMING_CLOCK_SOURCE_TSC)
> > > +                     printf(_("TSC clock source will be used by default, 
> > > unless timing_clock_source is set to 'system'.\n"));
> > > +             else
> > > +                     printf(_("TSC clock source will not be used by 
> > > default, unless timing_clock_source is set to 'tsc'.\n"));
> > > +     }
> > > +     else
> > > +             printf(_("TSC clock source is not usable. Likely unable to 
> > > determine TSC frequency. are you running in an unsupported virtualized 
> > > environment?.\n"));
> > > +#endif
> > > +
> >
> > A bit weird that most of the output stuff is handled in output(), but then
> > some of it is handled directly in main() now, some of it in test_timing().
>
> Hmm. I kind of see what you mean. I wonder if the main oddity there is
> having lots of TSC logic in the main function, when other stuff lives
> later in the file.
>
> To explore an alternate structure, I've added test_system_timing() and
> test_tsc_timing() methods, so its better abstracted. The new TSC
> printfs are still directly in test_tsc_timing (moved from main),
> because I don't see an easy way to have these happen in output().

Seems like an improvement.

Haven't yet had the bandwidth to review most of the new version.  Mostly
wanted to get out an explanation for the immintrin.h concern.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to