Hi Lukas,

On 1/2/23 20:50, Lukas Fittl wrote:
Thanks for continuing to work on this patch, and my apologies for silence on the patch.

It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().

I rebased again on master because of [1]. Patches attached.


Its been hard to make time, and especially so because I typically develop on an ARM-based macOS system where I can't test this directly - hence my tests with virtualized EC2 instances, where I ran into the timing oddities.
That's good and bad. Bad to do the development and good to test the implementation on more virtualized setups; given that I also encountered "interesting" behavior on VMWare (see my previous mails).

On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav...@gmail.com> wrote:

    The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
    variants
    return double. This seems error prone. What about renaming the
    function
    or also have the function return a double and cast where necessary at
    the call site?


Minor note, but in my understanding using a uint64 (where we can) is faster for any simple arithmetic we do with the values.

That's true. So the argument could be that for seconds and milliseconds we want the extra precision while microseconds are precise enough. Still, we could also make the seconds and milliseconds conversion code integer only and e.g. return two integers with the value before and after the comma. FWICS, the functions are nowhere used in performance critical code, so it doesn't really make a difference performance-wise.


+1, and feel free to carry this patch forward - I'll try to make an effort to review my earlier testing issues again, as well as your later improvements to the patch.
Moved to the current commit fest. Will you become reviewer?

Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE with timing over in [0], using a sampling-based approach to reduce the timing overhead.

Interesting idea. I'll reply with some thoughts on the corresponding thread.

[1] https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com

--
David Geier
(ServiceNow)
From f63527c8e4b3b0b71ffacaa111dd93325d973432 Mon Sep 17 00:00:00 2001
From: David Geier <geidav...@gmail.com>
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 ++++++++++++----------------
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd9..c64f21eb53 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(1000000000)
+#define US_PER_S INT64CONST(1000000)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, &tmp);
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 1000000000) \
-		{ \
-			(x).tv_nsec -= 1000000000; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 1000000000; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 1000000000; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 1000000000) \
-		{ \
-			(x).tv_nsec -= 1000000000; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y) - (z); \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
+	((double) (t) / NS_PER_S)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
+	((double) (t) / NS_PER_MS)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	(((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+	((double) (t) / NS_PER_US)
 
 #else							/* WIN32 */
 
-- 
2.34.1

From 3a268a74696681f2c6f666fd3ecbb020bca53599 Mon Sep 17 00:00:00 2001
From: David Geier <geidav...@gmail.com>
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/3] Use CPU reference cycles, via RDTSC, to measure time for
 instrumentation.

For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.

Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for
currently running CPU instructions to have finished, and that adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case.
---
 src/backend/utils/init/postinit.c       |   3 +
 src/bin/pg_test_timing/pg_test_timing.c |   1 +
 src/bin/pgbench/pgbench.c               |   3 +
 src/bin/psql/startup.c                  |   4 +
 src/common/Makefile                     |   1 +
 src/common/instr_time.c                 | 120 ++++++++++++++++++++++++
 src/include/portability/instr_time.h    |  52 ++++++++--
 src/tools/msvc/Mkvcbuild.pm             |   2 +-
 8 files changed, 175 insertions(+), 11 deletions(-)
 create mode 100644 src/common/instr_time.c

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..ec0661accc 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -804,6 +804,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	/* Initialize portal manager */
 	EnablePortalManager();
 
+	/* initialize high-precision interval timing */
+	INSTR_TIME_INITIALIZE();
+
 	/* Initialize status reporting */
 	pgstat_beinit();
 
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..0d667ff5a7 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -132,6 +132,7 @@ test_timing(unsigned int duration)
 
 	total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
 
+	INSTR_TIME_INITIALIZE();
 	INSTR_TIME_SET_CURRENT(start_time);
 	cur = INSTR_TIME_GET_MICROSEC(start_time);
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 820b1ea09f..2f0a306f7e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7076,6 +7076,9 @@ main(int argc, char **argv)
 		initRandomState(&state[i].cs_func_rs);
 	}
 
+	/* initialize high-precision interval timing */
+	INSTR_TIME_INITIALIZE();
+
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 5a28b6f713..3f95e01320 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
 #include "help.h"
 #include "input.h"
 #include "mainloop.h"
+#include "portability/instr_time.h"
 #include "settings.h"
 
 /*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
 
 	PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
 
+	/* initialize high-precision interval timing */
+	INSTR_TIME_INITIALIZE();
+
 	SyncVariables();
 
 	if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 898701fae1..fca4b546c8 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
 	file_perm.o \
 	file_utils.o \
 	hashfn.o \
+	instr_time.o \
 	ip.o \
 	jsonapi.o \
 	keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 0000000000..5f85350d1b
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ *	   Non-inline parts of the portable high-precision interval timing
+ *	 implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up with
+ * seconds. This indirection exists to support the rtdsc instruction.
+ *
+ * As a default, assume we are using clock_gettime() as a fallback and treat it
+ * as 1 "cycle" per nanosecond (aka 1 GHz).
+ *
+ * When using the RDTSC instruction directly this is filled in during
+ * initialization based on the relevant cpuid fields.
+ */
+double cycles_to_sec = 1.0 / NS_PER_S;
+double cycles_to_ms = 1.0 / MS_PER_S;
+double cycles_to_us = 1.0 / US_PER_S;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Determines whether rdtsc is used (Linux/x86 only, when OS uses tsc clocksource)
+ */
+bool use_rdtsc = false;
+
+static bool under_hypervisor(void)
+{
+	uint32 r0, r1, r2, r3;
+	int result = __get_cpuid(0x1, &r0, &r1, &r2, &r3);
+	return result > 0 && ((r2 >> 31) & 1) == 1;
+}
+
+static bool get_hypervisor_tsc_frequency_khz(uint32 *tsc_freq)
+{
+	uint32 r0, r1, r2;
+
+	if (!under_hypervisor())
+		return false;
+
+	/*
+	* Use __cpuid() because __get_cpuid() bails out for leafs > 0x80000000.
+	* Likely, it's not accounting for recently introduced VM leafs.
+	* Unlike leaf 0x16, the hypervisor timing leafs return the frequency already in KHz.
+	*/
+	__cpuid(0x40000010, *tsc_freq, r0, r1, r2);
+	return *tsc_freq > 0;
+}
+
+static bool get_host_tsc_frequency_khz(uint32 *tsc_freq)
+{
+	uint32 r0, r1, r2;
+	int result = __get_cpuid(0x16, tsc_freq, &r0, &r1, &r2);
+	*tsc_freq *= 1000; // Convert from MHz to KHz
+	return result > 0 && *tsc_freq > 0;
+}
+
+static bool get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+	if (get_hypervisor_tsc_frequency_khz(tsc_freq))
+		return true;
+	if (get_host_tsc_frequency_khz(tsc_freq))
+		return true;
+
+	return false;
+}
+
+/*
+ * Decide whether we use the rdtsc instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void pg_clock_gettime_initialize_rdtsc(void)
+{
+	FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+	if (fp)
+	{
+		char buf[128];
+
+		if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+		{
+			/*
+			* Compute baseline CPU peformance, determines speed at which RDTSC advances.
+			*/
+			uint32 tsc_freq;
+
+			if (get_tsc_frequency_khz(&tsc_freq))
+			{
+				cycles_to_sec = 1.0 / (tsc_freq * 1000);
+				cycles_to_ms = 1.0 / tsc_freq;
+				cycles_to_us = 1000.0 / tsc_freq;
+				use_rdtsc = true;
+			}
+		}
+
+		fclose(fp);
+	}
+}
+#endif							/* defined(__x86_64__) && defined(__linux__) */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index c64f21eb53..9c004fa56e 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
  *	  portable high-precision interval timing
  *
  * This file provides an abstraction layer to hide portability issues in
- * interval timing.  On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter().  These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS.  Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
  *
  * The basic data type is instr_time, which all callers should treat as an
  * opaque typedef.  instr_time can store either an absolute time (of
@@ -56,10 +58,15 @@
 
 #ifndef WIN32
 
-/* Use clock_gettime() */
+/* Uses rdtsc on Linux/x86 if available, otherwise clock_gettime() */
 
 #include <time.h>
 
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+#endif
+
 /*
  * The best clockid to use according to the POSIX spec is CLOCK_MONOTONIC,
  * since that will give reliable interval timing even in the face of changes
@@ -80,7 +87,9 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
+/* time in cpu reference cycles (when using rdtsc), otherwise nanoseconds */
 typedef int64 instr_time;
+
 #define NS_PER_S INT64CONST(1000000000)
 #define US_PER_S INT64CONST(1000000)
 #define MS_PER_S INT64CONST(1000)
@@ -92,17 +101,38 @@ typedef int64 instr_time;
 
 #define INSTR_TIME_SET_ZERO(t)	((t) = 0)
 
-static inline instr_time pg_clock_gettime_ns(void)
+extern double cycles_to_sec;
+extern double cycles_to_ms;
+extern double cycles_to_us;
+
+#if defined(__x86_64__) && defined(__linux__)
+extern bool use_rdtsc;
+
+extern void pg_clock_gettime_initialize_rdtsc(void);
+#endif
+
+static inline instr_time pg_clock_gettime_ref_cycles(void)
 {
 	struct timespec tmp;
 
+#if defined(__x86_64__) && defined(__linux__)
+	if (use_rdtsc)
+		return __rdtsc();
+#endif
+    
 	clock_gettime(PG_INSTR_CLOCK, &tmp);
-
 	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
 }
 
+#if defined(__x86_64__) && defined(__linux__)
+#define INSTR_TIME_INITIALIZE() \
+       pg_clock_gettime_initialize_rdtsc()
+#else
+#define INSTR_TIME_INITIALIZE()
+#endif
+
 #define INSTR_TIME_SET_CURRENT(t) \
-	(t) = pg_clock_gettime_ns()
+	(t) = pg_clock_gettime_ref_cycles()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
@@ -120,13 +150,13 @@ static inline instr_time pg_clock_gettime_ns(void)
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	((double) (t) / NS_PER_S)
+	((double) (t) * cycles_to_sec)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	((double) (t) / NS_PER_MS)
+	((double) (t) * cycles_to_ms)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	((double) (t) / NS_PER_US)
+	((uint64) (t) * cycles_to_us)
 
 #else							/* WIN32 */
 
@@ -138,6 +168,8 @@ typedef LARGE_INTEGER instr_time;
 
 #define INSTR_TIME_SET_ZERO(t)	((t).QuadPart = 0)
 
+#define INSTR_TIME_INITIALIZE()
+
 #define INSTR_TIME_SET_CURRENT(t)	QueryPerformanceCounter(&(t))
 
 #define INSTR_TIME_ADD(x,y) \
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index afa40d885e..40910991e3 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -135,7 +135,7 @@ sub mkvcbuild
 	our @pgcommonallfiles = qw(
 	  archive.c base64.c checksum_helper.c compression.c
 	  config_info.c controldata_utils.c d2s.c encnames.c exec.c
-	  f2s.c file_perm.c file_utils.c hashfn.c ip.c jsonapi.c
+	  f2s.c file_perm.c file_utils.c hashfn.c ip.c instr_time.o jsonapi.c
 	  keywords.c kwlookup.c link-canary.c md5_common.c
 	  pg_get_line.c pg_lzcompress.c pg_prng.c pgfnames.c psprintf.c relpath.c
 	  rmtree.c saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c
-- 
2.34.1

From 4d6ab38c2de7d0d7ec22a4d016e53b4d82828040 Mon Sep 17 00:00:00 2001
From: David Geier <geidav...@gmail.com>
Date: Mon, 2 Jan 2023 13:32:58 +0100
Subject: [PATCH 3/3] Refactor some instr_time related code

---
 src/backend/access/heap/vacuumlazy.c | 3 +--
 src/backend/access/transam/xlog.c    | 4 ++--
 src/backend/storage/buffer/bufmgr.c  | 4 ++--
 src/bin/psql/common.c                | 8 ++++----
 4 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 06fd15405f..c6c69b5547 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2994,8 +2994,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 			INSTR_TIME_SET_CURRENT(currenttime);
 			elapsed = currenttime;
 			INSTR_TIME_SUBTRACT(elapsed, starttime);
-			if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
-				>= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+			if (INSTR_TIME_GET_MILLISEC(elapsed) >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
 			{
 				if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
 				{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0070d56b0b..240e5afe35 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2177,7 +2177,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
-			instr_time	start;
+			instr_time	start = 0;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -8129,7 +8129,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	start = 0;
 
 	Assert(tli != 0);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cf..96f6735398 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2824,8 +2824,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
-	instr_time	io_start,
-				io_time;
+	instr_time	io_start = 0;
+	instr_time	io_time = 0;
 	Block		bufBlock;
 	char	   *bufToWrite;
 	uint32		buf_state;
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 38f9b10b7c..5c835c43ad 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1269,8 +1269,8 @@ DescribeQuery(const char *query, double *elapsed_msec)
 	bool		timing = pset.timing;
 	PGresult   *result;
 	bool		OK;
-	instr_time	before,
-				after;
+	instr_time	before = 0;
+	instr_time	after = 0;
 
 	*elapsed_msec = 0;
 
@@ -1689,8 +1689,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
 	int			ntuples;
 	int			fetch_count;
 	char		fetch_cmd[64];
-	instr_time	before,
-				after;
+	instr_time	before = 0;
+	instr_time	after = 0;
 	int			flush_error;
 
 	*elapsed_msec = 0;
-- 
2.34.1

Reply via email to