Hi, Lukas, could you add some preliminary Reviewed-by tags to the not-yet-committed commitse?
On 2026-03-25 18:17:44 -0700, Lukas Fittl wrote: > From 7cc7e230e05947892b2aa479eba45144beede48e Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <[email protected]> > Date: Sat, 31 Jan 2026 08:49:46 -0800 > Subject: [PATCH v14 1/6] Check for HAVE__CPUIDEX and HAVE__GET_CPUID_COUNT > separately John, do you want to take this one, or should I take it? > Subject: [PATCH v14 2/6] pg_test_timing: Reduce per-loop overhead Pushed, after simplifying the following > + INSTR_TIME_ADD_NANOSEC(end_time, duration > 0 ? duration * NS_PER_S : > 0); by removing that ?:, since duration is unsigned and the caller checks for 0 and negative values. And even if a negative value or 0 were to be passed, it'd still be harmless. > From 8fbde9831fd89615a41af56917aed9faf619dbbb Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <[email protected]> > Date: Fri, 25 Jul 2025 17:57:20 -0700 > Subject: [PATCH v14 3/6] instrumentation: Streamline ticks to nanosecond > conversion across platforms > > The timing infrastructure (INSTR_* macros) measures time elapsed using > clock_gettime() on POSIX systems, which returns the time as nanoseconds, > and QueryPerformanceCounter() on Windows, which is a specialized timing > clock source that returns a tick counter that needs to be converted to > nanoseconds using the result of QueryPerformanceFrequency(). > > This conversion currently happens ad-hoc on Windows, e.g. when calling > INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every > invocation, despite the frequency being stable after program start, > incurring unnecessary overhead. It also causes a fractured implementation > where macros are defined differently between platforms. > > To ease code readability, and prepare for a future change that intends > to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce > a new pg_ticks_to_ns() function that gets called on all platforms. > > This function relies on a separately initialized ticks_per_ns_scaled > value, that represents the conversion ratio. This value is initialized > from QueryPerformanceFrequency() on Windows, and set to zero on x86-64 > POSIX systems, which results in the ticks being treated as nanoseconds. > Other architectures always directly return the original ticks. > To support this, pg_initialize_timing() is introduced, and is now > mandatory for both the backend and any frontend programs to call before > utilizing INSTR_* macros. Are there assertions detecting missing initializations? Ah, yes, there is. > +/* > + * Stores what the number of ticks needs to be multiplied with to end up > + * with nanoseconds using integer math. > + * > + * On certain platforms (currently Windows) the ticks to nanoseconds > conversion > + * requires floating point math because: > + * > + * sec = ticks / frequency_hz > + * ns = ticks / frequency_hz * 1,000,000,000 > + * ns = ticks * (1,000,000,000 / frequency_hz) > + * ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz > + * > + * Here, 'ns' is usually a floating number. For example for a 2.5 GHz CPU > + * the scaling factor becomes 1,000,000 / 2,500,000 = 1.2. > + * > + * To be able to use integer math we work around the lack of precision. We > + * first scale the integer up (left shift by TICKS_TO_NS_SHIFT) and after the > + * multiplication by the number of ticks in pg_ticks_to_ns() we shift right > by > + * the same amount. We utilize unsigned integers even though ticks are stored > + * as a signed value to encourage compilers to generate better assembly. > + * > + * We remember the maximum number of ticks that can be multiplied by the > scale > + * factor without overflowing so we can check via a * b > max <=> a > max / > b. > + * > + * On all other platforms we are using clock_gettime(), which uses > nanoseconds > + * as ticks. Hence, we set the multiplier to zero, which causes > pg_ticks_to_ns > + * to return the original value. > + */ > +uint64 ticks_per_ns_scaled = 0; > +uint64 max_ticks_no_overflow = 0; > +bool timing_initialized = false; Think it may be worth adding an example for a realistic max_ticks_no_overflow to show that we almost always are going to be able to avoid the overflow case. > +static void set_ticks_per_ns(void); > + > +void > +pg_initialize_timing(void) > +{ > + if (timing_initialized) > + return; > + > + set_ticks_per_ns(); > + timing_initialized = true; > +} > + > +#ifndef WIN32 > + > +static void > +set_ticks_per_ns(void) > +{ > + ticks_per_ns_scaled = 0; > + max_ticks_no_overflow = 0; > +} > + > +#else /* WIN32 */ > + > +/* GetTimerFrequency returns counts per second */ > +static inline double > +GetTimerFrequency(void) > +{ > + LARGE_INTEGER f; > + > + QueryPerformanceFrequency(&f); > + return (double) f.QuadPart; > +} > + > +static void > +set_ticks_per_ns(void) > +{ > + ticks_per_ns_scaled = (NS_PER_S << TICKS_TO_NS_SHIFT) / > GetTimerFrequency(); > + max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled; > +} > + > +#endif /* WIN32 */ Seems a tiny bit odd to introduce them here with a name to then evolve the naming pattern in the subsequent patches. > +static inline int64 > +pg_ticks_to_ns(int64 ticks) > { > - LARGE_INTEGER f; > +#if PG_INSTR_TICKS_TO_NS > + int64 ns = 0; > + > + Assert(timing_initialized); > + > + /* > + * Avoid doing work if we don't use scaled ticks, e.g. system clock on > + * Unix > + */ > + if (ticks_per_ns_scaled == 0) > + return ticks; > + > + /* > + * Would multiplication overflow? If so perform computation in two > parts. > + */ > + if (unlikely(ticks > (int64) max_ticks_no_overflow)) > + { > + /* > + * To avoid overflow, first scale total ticks down by the fixed > + * factor, and *afterwards* multiply them by the > frequency-based scale > + * factor. > + * > + * The remaining ticks can follow the regular formula, since > they > + * won't overflow. > + */ > + int64 count = ticks >> TICKS_TO_NS_SHIFT; > + > + ns = count * ticks_per_ns_scaled; > + ticks -= (count << TICKS_TO_NS_SHIFT); > + } > + > + ns += (ticks * ticks_per_ns_scaled) >> TICKS_TO_NS_SHIFT; > + > + return ns; > +#else > + Assert(timing_initialized); > > - QueryPerformanceFrequency(&f); > - return (double) f.QuadPart; > + return ticks; > +#endif /* > PG_INSTR_TICKS_TO_NS */ Kinda wondering what the best way would be to verify that the overflow case actually works. What about adding a regress.c test that checks that we get the right results for something like INSTR_TIME_SET_ZERO(t); INSTR_TIME_ADD_NANOSEC(t, some_number); EXPECT_EQ_U64(INSTR_TIME_GET_NANOSEC(t), some_number); for a few different some_numbers? > Subject: [PATCH v14 4/6] instrumentation: Use Time-Stamp Counter (TSC) on > x86-64 for faster measurements > +bool > +check_timing_clock_source(int *newval, void **extra, GucSource source) > +{ > + /* > + * Ensure timing is initialized. On Windows (EXEC_BACKEND), GUC hooks > can > + * be called during InitializeGUCOptions() before InitProcessGlobals() > has > + * had a chance to run pg_initialize_timing(). > + */ > + pg_initialize_timing(); So doesn't that mean the effort to synchronize the tsc freq on EXEC_BACKEND is futile? > diff --git a/src/bin/pg_test_timing/pg_test_timing.c > b/src/bin/pg_test_timing/pg_test_timing.c > index 1d9ee4fb588..ee0e3a3b0ab 100644 > --- a/src/bin/pg_test_timing/pg_test_timing.c > +++ b/src/bin/pg_test_timing/pg_test_timing.c > @@ -43,7 +43,9 @@ main(int argc, char *argv[]) > > handle_args(argc, argv); > > - /* initialize timing infrastructure (required for INSTR_* calls) */ > + /* > + * Initialize timing infrastructure (required for INSTR_* calls) > + */ > pg_initialize_timing(); > > loop_count = test_timing(test_duration); This seems like a superfluous change? > +bool > +pg_set_timing_clock_source(TimingClockSourceType source) > +{ > + Assert(timing_initialized); > + > +#if PG_INSTR_TSC_CLOCK > + pg_initialize_timing_tsc(); > +#endif > + > +#if PG_INSTR_TSC_CLOCK > + switch (source) > + { > + case TIMING_CLOCK_SOURCE_AUTO: > + use_tsc = (tsc_frequency_khz > 0) && > tsc_use_by_default(); A global named use_tsc seems a tad too short a name for my personal taste. Too likely to also be used by something else. > +int32 tsc_frequency_khz = -1; > + > +static uint32 tsc_calibrate(void); > + > +/* > + * Detect the TSC frequency and whether RDTSCP is available on x86-64. > + * > + * This can't be reliably determined at compile time, since the > + * availability of an "invariant" TSC (that is not affected by CPU > + * frequency changes) is dependent on the CPU architecture. Additionally, > + * there are cases where TSC availability is impacted by virtualization, > + * where a simple cpuid feature check would not be enough. > + */ > +static void > +tsc_detect_frequency(void) > +{ > + tsc_frequency_khz = 0; So we're using -1 for uninitialized and 0 for unknown? If so we should document that. > + /* We require RDTSCP support, bail if not available */ > + if (!x86_feature_available(PG_RDTSCP)) > + return; > + > + /* Determine speed at which the TSC advances */ > + tsc_frequency_khz = x86_tsc_frequency_khz(); > + if (tsc_frequency_khz > 0) > + return; > + > + /* > + * CPUID did not give us the TSC frequency. If TSC is invariant and > RDTSCP > + * is available, we can measure the frequency by comparing TSC ticks > + * against walltime using a short calibration loop. > + */ > + if (x86_feature_available(PG_TSC_INVARIANT)) > + tsc_frequency_khz = tsc_calibrate(); > +} Why do we not need to always check if the TSC is invariant? Seems like a hard requirement to me? > +/* > + * Calibrate the TSC frequency by comparing TSC ticks against walltime. > + * > + * Takes initial TSC and system clock snapshots, then loops, recomputing the > + * frequency each TSC_CALIBRATION_SKIPS iterations from cumulative TSC > + * ticks divided by elapsed time. > + * > + * Once the frequency estimate stabilizes (consecutive iterations agree), we > + * consider it converged and the frequency in KHz is returned. If either too > + * many iterations or a time limit passes without convergence, 0 is returned. > + */ > +#define TSC_CALIBRATION_MAX_NS (50 * NS_PER_MS) > +#define TSC_CALIBRATION_ITERATIONS 1000000 > +#define TSC_CALIBRATION_SKIPS 100 > +#define TSC_CALIBRATION_STABLE_CYCLES 10 > + > +static uint32 > +tsc_calibrate(void) > +{ > + instr_time initial_wall; > + int64 initial_tsc; > + double freq_khz = 0; > + double prev_freq_khz = 0; > + int stable_count = 0; > + int64 prev_tsc; > + uint32 unused; > + > + /* Ensure INSTR_* time below work on system time */ > + set_ticks_per_ns_system(); > + > + INSTR_TIME_SET_CURRENT(initial_wall); > + > +#ifdef _MSC_VER > + initial_tsc = __rdtscp(&unused); > +#else > + initial_tsc = __builtin_ia32_rdtscp(&unused); > +#endif > + prev_tsc = initial_tsc; > + > + for (int i = 0; i < TSC_CALIBRATION_ITERATIONS; i++) > + { > + instr_time now_wall; > + int64 now_tsc; > + int64 elapsed_ns; > + int64 elapsed_ticks; > + > + INSTR_TIME_SET_CURRENT(now_wall); > + > +#ifdef _MSC_VER > + now_tsc = __rdtscp(&unused); > +#else > + now_tsc = __builtin_ia32_rdtscp(&unused); > +#endif I'd put this in a helper, given that we already need it twice, and we might later need to extend it for other architectures. > + /* > + * Skip if this is not the Nth cycle where we measure, if TSC > hasn't > + * advanced, or we walked backwards for some reason. > + */ > + if (i % TSC_CALIBRATION_SKIPS != 0 || now_tsc == prev_tsc || > elapsed_ns <= 0 || elapsed_ticks <= 0) > + continue; What's the goal of TSC_CALIBRATION_SKIPS? > + freq_khz = ((double) elapsed_ticks / elapsed_ns) * 1000 * 1000; > + > + /* > + * Once freq_khz / prev_freq_khz is small, check if it stays > that way. > + * If it does for long enough, we've got a winner frequency. > + */ > + if (prev_freq_khz != 0 && fabs(1 - freq_khz / prev_freq_khz) < > 0.0001) > + { > + stable_count++; > + if (stable_count >= TSC_CALIBRATION_STABLE_CYCLES) > + return (uint32) freq_khz; > + } > + else > + stable_count = 0; > + > + prev_tsc = now_tsc; > + prev_freq_khz = freq_khz; > + } > + > + /* did not converge */ > + return 0; > +} I think it may be worth teaching pg_test_timing to display the tsc frequency and show the frequency determined by calibration even if we got it via cpuid. Otherwise it'll be hard to debug this. > +/* > + * Initialize the TSC clock source by determining its usability and > frequency. > + * > + * This can be called multiple times, as tsc_frequency_khz will be set to 0 > + * if a prior call determined the TSC is not usable. On EXEC_BACKEND > (Windows), > + * the TSC frequency may also be set by restore_backend_variables. > + */ > +void > +pg_initialize_timing_tsc(void) > +{ > + if (tsc_frequency_khz < 0) > + tsc_detect_frequency(); > +} Maybe we should add a debug log with the frequency in the < 0 case? > +uint32 > +x86_tsc_frequency_khz(void) > +{ > + unsigned int r[4] = {0}; > + > + if (!x86_feature_available(PG_TSC_INVARIANT)) > + return 0; > + > + if (x86_feature_available(PG_HYPERVISOR)) > + return x86_hypervisor_tsc_frequency_khz(); > + > + /* > + * On modern Intel CPUs, the TSC is implemented by invariant timekeeping > + * hardware, also called "Always Running Timer", or ART. The ART stays > + * consistent even if the CPU changes frequency due to changing power > + * levels. > + * > + * As documented in "Determining the Processor Base Frequency" in the > + * "IntelĀ® 64 and IA-32 Architectures Software Developer's Manual", > + * February 2026 Edition, we can get the TSC frequency as follows: > + * > + * Nominal TSC frequency = ( CPUID.15H:ECX[31:0] * CPUID.15H:EBX[31:0] > ) / > + * CPUID.15H:EAX[31:0] > + * > + * With CPUID.15H:ECX representing the nominal core crystal clock > + * frequency, and EAX/EBX representing values used to translate the TSC > + * value to that frequency, see "Chapter 20.17 "Time-Stamp Counter" of > + * that manual. > + * > + * Older Intel CPUs, and other vendors do not set CPUID.15H:ECX, and as > + * such we fall back to alternate approaches. > + */ > + pg_cpuid(0x15, r); > + if (r[ECX] > 0) > + { > + /* > + * EBX not being set indicates invariant TSC is not available. > Require > + * EAX being non-zero too, to avoid a theoretical divide by > zero. > + */ > + if (r[EAX] == 0 || r[EBX] == 0) > + return 0; > + > + return r[ECX] / 1000 * r[EBX] / r[EAX]; > + } > + > + /* > + * When CPUID.15H is not available/incomplete, but we have verified an > + * invariant TSC is used, we can instead get the processor base > frequency > + * in MHz from CPUID.16H:EAX, the "Processor Frequency Information > Leaf". > + */ s/can instead/can possibly instead/ As it's also optional... > From 44292c17075264c659e03d784a6c7619e4c5bd3e Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <[email protected]> > Date: Tue, 10 Mar 2026 01:38:14 -0700 > Subject: [PATCH v14 6/6] instrumentation: ARM support for fast time > measurements > > Similar to the RDTSC/RDTSCP instructions on x68-64, this introduces > use of the cntvct_el0 instruction on ARM systems to access the generic > timer that provides a synchronized ticks value across CPUs. > > Note this adds an exception for Apple Silicon CPUs, due to the observed > fact that M3 and newer has different timer frequencies for the Efficiency > and the Performance cores, and we can't be sure where we get scheduled. There are other such SOCs with different clock speeds for performance/efficiency cores, I'm a bit worried that just denylisting Apple Silicon will end up allowing other such systems. Greetings, Andres Freund
