Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Zsolt Parragi Wed, 11 Mar 2026 12:53:02 -0700

+ /*
+ * Once freq_khz / prev_freq_khz is small, check if it stays that way.
+ * If it does for long enough, we've got a winner frequency.
+ */
+ if (prev_freq_khz != 0 && fabs(freq_khz / prev_freq_khz) < 1.0001)


Doesn't this accept any frequency decrease as stable? Shouldn't it
instead use something like

fabs(1 - freq_khz / prev_freq_khz) < 0.0001

which works in both directions?


- total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
+ INSTR_TIME_SET_NANOSEC(duration_time, duration > 0 ? duration * NS_PER_S : 0);

  INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_NANOSEC(start_time);
+ cur = start_time;

- while (time_elapsed < total_time)
+ end_time = start_time;
+ INSTR_TIME_ADD(end_time, duration_time);

Is this correct? Won't INSTR_TIME_ADD add nanoseconds (duration_time)
to TSC ticks (end_time)? INSTR_TIME_GET_NANOSEC uses pg_ticks_to_ns,
but INSTR_TIME_SET_NANOSEC simply stores the value without any
conversion function.

Also, if the clock source is switched during statement we get totally
incorrect results. I don't think this has any real consequences in the
current use case, but maybe it's worth at least mentioning somewhere?

CREATE FUNCTION switch_timing_source_and_sleep()
RETURNS void AS $$
BEGIN
    SET timing_clock_source = 'system';
    PERFORM pg_sleep(0.01);
END;
$$ LANGUAGE plpgsql;

SET timing_clock_source = 'tsc';
EXPLAIN ANALYZE SELECT switch_timing_source_and_sleep();

On Wed, Mar 11, 2026 at 7:46 PM Lukas Fittl <[email protected]> wrote:
>
> Hi,
>
> Attached v11, with the following changes:
>
> 0001 is a new patch that implements the refactorings of the CPUID code
> suggested by John.
> 0002 is the existing patch for supporting using __cpuidex directly
> (needed by the TSC hypervisor frequency code).
> 0003 is the existing patch as before to optimize pg_test_timing.
>
> 0004 is almost identical to the previous patch (v10/0003) that adds
> the ticks to NS conversion, with a small improvement to use an
> explicit define (PG_INSTR_TICKS_TO_NS) that controls whether we go
> into the complex pg_ticks_to_ns logic at all.
>
> 0005 is the TSC patch with the following changes:
>
> - Dropped the HyperV MSR read again, per Andres feedback
> - Added a TSC calibration loop that is used if we can't get the
> frequency from CPUID. This is based on a script that Andres shared
> off-list, and works both on HyperV as well as my bare metal AMD CPU.
> Note that we don't utilize this on Windows to avoid the delay for the
> calibration to converge (< 50ms, typically less than 1ms) penalizing
> connection start (since we don't get the tsc frequency global from
> postmaster before the fork)
> - Moved the GUC logic to instrument.c, because we shouldn't be
> defining GUCs in a file that's built with front-end programs
>
> Its worth noting that I have not yet included a way to pass debug
> information back to the user (e.g. when the TSC calibration didn't
> converge, the TSC is not invariant, etc), as Jakub suggested
> previously - with the TSC calibration code in the picture I'm less
> sure its really needed, since e.g. looking at the "cpuid" program will
> tell you whether calibration runs or not, and then you could infer
> that the calibration failed if you didn't get a usable TSC reported by
> pg_test_timing.
>
> 0006 is the existing patch as before to add pg_test_timing debug output.
>
> 0007 is a new patch that shows how we could expand this to also be
> used for ARM, by calling CNTVCT_EL0. I'm mainly adding this because I
> think its the main evolution of this that we haven't talked about that
> much yet, and even if we do this in a later release cycle it'll help
> refine the design. This worked as expected for me on an AWS Graviton
> instance, but failed on an Apple Silicon M3 due to quirks with its
> Efficiency vs Performance core - dealt with in the patch by not using
> the generic timer directly when we're on a homogeneous core system.
>
> Looking at how an ARM implementation could work does make me wonder
> one thing in general: Maybe we shouldn't be using the term "tsc" for
> "timing_clock_source" (and internal defines), since "TSC" is an x86
> term, that doesn't really make sense when we expand this to ARM in a
> future release. Maybe we should use a generic name, like "hwtimer",
> "direct" or "hardware"?
>
> On Sun, Mar 8, 2026 at 9:39 AM Andres Freund <[email protected]> wrote:
> > > Alternatively, we could consider doing it like the Kernel does it for
> > > its calibration loop, and wait 1 second of wall time, and then see how
> > > far the TSC counter has advanced.
> >
> > Yea, I think we need a calibration loop, unfortunately. But I think it 
> > should
> > be doable to make it a lot quicker than waiting one second.  I'm thinking of
> > something like a loop that measures the the clock cycles and relative time
> > (using clock_gettime()) since the start and does so until the frequency
> > estimate predicts the time results closely.  I think should be a few 10s of
> > milliseconds at most.
>
> That is implemented now in 0005, based on the script you shared off
> list, which I hopefully translated correctly to the Postgres source.
> The one part I wasn't sure about is whether we want to use RDTSCP for
> the calibration, or RDTSC+LFENCE like you had it in your script (which
> is closer to what the abseil library does, that I mentioned upthread)
> - for now I went with RDTSCP to keep it simple.
>
> We run up to 1 million RDTSCP instructions, for at most 50ms, and
> terminate once the frequency stays stable for at least 3 iterations.
> In testing this converges pretty quickly in practice (<1ms) and
> closely matches the reported TSC frequency by the Linux kernel.
>
> Thanks,
> Lukas
>
> --
> Lukas Fittl

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to