Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Lukas Fittl Thu, 12 Feb 2026 20:12:08 -0800

On Thu, Feb 12, 2026 at 4:41 PM Andres Freund <[email protected]> wrote:
> On 2026-02-12 08:05:27 -0800, Lukas Fittl wrote:
> > On master (88327092ff0), I'm getting 23.54 ns from pg_test_timing - vs
> > with 0002 applied, this slows to 25.74 ns. I've tried to see if the
> > "unlikely(..)" we added in pg_ticks_to_ns is the problem (since in the
> > clock_gettime() case we'd always be running into that branch due to
> > the size of the nanoseconds value), but no luck - I think the extra
> > multiplication/division itself is the problem.
> >
> > Any ideas how we could do this differently?
>
> The problem looks to be that you're going to take the slowpath when using
> clock_gettime(), unless you booted within the last three days
>
> This can largely be addressed by keeping prev and cur in the instr_time
> domain and only converting the difference to nanoseconds.


Right - we actually already had that pg_test_timing change in the
patch series in the pg_test_timing patch -- might have even been
authored by you in a prior iteration? -- however, that did not appear
to resolve the regression fully for me. I'll move this to the 0002
patch to ease testing.

> I wonder if pg_test_timing should have a small loop with a fixed count to
> determine the timing without all the overhead the existing loop has...

I agree that using a fixed count in pg_test_timing would be helpful to
measure just the timing gathering itself, vs the translation into
nanoseconds.

As an alternate to fixing this in pg_test_timing alone (in the off
chance another caller does a lot of ticks-to-time conversions), I
spent some time today looking at the assembly today for pg_ticks_to_ns
today, see [0], and specifically tried:

(1) changing the pg_ticks_to_ns logic to have an explicit
"ticks_per_ns_scaled == 0" early check and return at the start, and
setting ticks_per_ns_scaled to 0 when clock_gettime() gets used. This
is similar to what David already suggested in an earlier email.
(2) using uint64 for the ticks_per_ns_scaled/max_ticks_no_overflow
variables - this appears to help GCC generate a bit shift reliably,
instead of an idiv instruction.

That appears to eliminate the regression in my testing. Attached an
updated v7, which also has some slightly improved commit messages.

Additional comparisons with the test case you had back at the start of
this thread, with system clock source on my test VM:

master:

EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1888.891 ms (best of 3)
pg_test_timing / Average loop time including overhead: 23.53 ns

v6 (0002 + pg_test_timing prev/cur change):

EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1897.095 ms (best of 3)
pg_test_timing / Average loop time including overhead: 25.52 ns

v7 (0002):

EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1897.148 ms (best of 3)
Average loop time including overhead: 23.14 ns

And when looking at the TSC time source with the full patch set on the same VM:

v6:

EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1477.672 ms (best of 3)
pg_test_timing / Average loop time including overhead: 11.79 ns

v7:

EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1476.326 ms (best of 3)
pg_test_timing / Average loop time including overhead: 11.78 ns

Thanks,
Lukas

[0]: https://godbolt.org/z/EvK1M66n5

--
Lukas Fittl

v7-0003-Timing-Use-Time-Stamp-Counter-TSC-on-x86-64-for-f.patch
Description: Binary data

v7-0001-Check-for-HAVE__CPUIDEX-and-HAVE__GET_CPUID_COUNT.patch
Description: Binary data

v7-0002-Timing-Streamline-ticks-to-nanosecond-conversion-.patch
Description: Binary data

v7-0004-pg_test_timing-Also-test-RDTSC-RDTSCP-timing-and-.patch
Description: Binary data

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to