I spent some time trying to devise a suitable performance microbenchmark for the atomic ops, in pursuit of whether the proposal at https://commitfest.postgresql.org/14/1154/ is worth doing. I came up with the attached very simple-minded test case, which you run with something like
create function my_test_atomic_ops(bigint) returns int strict volatile language c as '/path/to/atomic-perf-test.so'; \timing select my_test_atomic_ops(1000000000); The performance of a single process running this is interesting, but only mildly so: what we want to know about is what happens when you run two or more calls concurrently. On my primary server, dual quad-core Xeon E5-2609 @ 2.4GHz, RHEL6 (so gcc version 4.4.7 20120313 (Red Hat 4.4.7-18)), in a disable-cassert build, I see that a single process running the 1G-iterations case repeatably takes about 9600ms. Two competing processes take roughly 1 minute to do twice as much work. (The two processes tend to finish at significantly different times, indicating that this box's method for resolving bus conflicts isn't 100% fair. I'm taking the average of the two runtimes as a representative number.) This is with no source-code changes, meaning that what I'm testing is arch-x86.h's version of pg_atomic_fetch_add_u32, which compiles to basically lock xaddl %eax,(%rdx) I then diked out that version, so that the build fell back to generic-gcc.h's version of the function. With the test program as attached, the inner loop is basically the same, and so is the runtime. But what I was testing before that was a version that ignored the result of pg_atomic_fetch_add_u32, while (count-- > 0) { (void) pg_atomic_fetch_add_u32(myptr, 1); } and what I was quite surprised to see was a single-thread time of 9600ms and a two-thread time of ~40s. The reason was not too far to seek: gcc is smart enough to notice that it doesn't need the result of pg_atomic_fetch_add_u32, and so it compiles that to just lock addl $1, (%rax) which is evidently significantly more efficient than the xaddl under contention load. Or in words of one syllable: at least for pg_atomic_fetch_add_u32, we are working hard in atomics/arch-x86.h to get worse code than gcc would give us natively. (And, in case you didn't notice, this is far from the latest and shiniest gcc.) This case is not to be dismissed as insignificant either, since of the three non-test occurrences of pg_atomic_fetch_add_u32 in our tree, two ignore the result. So I think we'd be well advised to cast a doubtful eye at the asm constructs we've got here, and figure out which ones are really meaningfully smarter than gcc's primitives. regards, tom lane
#include "postgres.h" #include "fmgr.h" #include "storage/lwlock.h" #include "storage/shmem.h" PG_MODULE_MAGIC; static pg_atomic_uint32 *globptr = NULL; int32 globjunk = 0; PG_FUNCTION_INFO_V1(my_test_atomic_ops); Datum my_test_atomic_ops(PG_FUNCTION_ARGS) { int64 count = PG_GETARG_INT64(0); int32 result; pg_atomic_uint32 *myptr; int32 junk = 0; if (globptr == NULL) { /* First time through in this process; find shared memory */ bool found; LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE); globptr = ShmemInitStruct("my_test_atomic_ops", sizeof(*globptr), &found); if (!found) { /* First time through anywhere */ pg_atomic_init_u32(globptr, 0); } LWLockRelease(AddinShmemInitLock); } myptr = globptr; while (count-- > 0) { junk += pg_atomic_fetch_add_u32(myptr, 1); } globjunk += junk; result = pg_atomic_read_u32(myptr); PG_RETURN_INT32(result); }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers