generic-gcc.h?

Tom Lane Tue, 05 Sep 2017 19:31:55 -0700

I spent some time trying to devise a suitable performance microbenchmark
for the atomic ops, in pursuit of whether the proposal at
https://commitfest.postgresql.org/14/1154/
is worth doing.  I came up with the attached very simple-minded test
case, which you run with something like


        create function my_test_atomic_ops(bigint) returns int
        strict volatile language c as '/path/to/atomic-perf-test.so';

        \timing

        select my_test_atomic_ops(1000000000);

The performance of a single process running this is interesting, but
only mildly so: what we want to know about is what happens when you
run two or more calls concurrently.

On my primary server, dual quad-core Xeon E5-2609 @ 2.4GHz, RHEL6
(so gcc version 4.4.7 20120313 (Red Hat 4.4.7-18)), in a disable-cassert
build, I see that a single process running the 1G-iterations case
repeatably takes about 9600ms.  Two competing processes take roughly
1 minute to do twice as much work.  (The two processes tend to finish
at significantly different times, indicating that this box's method
for resolving bus conflicts isn't 100% fair.  I'm taking the average
of the two runtimes as a representative number.)

This is with no source-code changes, meaning that what I'm testing is
arch-x86.h's version of pg_atomic_fetch_add_u32, which compiles to
basically

        lock
        xaddl   %eax,(%rdx)

I then diked out that version, so that the build fell back to
generic-gcc.h's version of the function.  With the test program
as attached, the inner loop is basically the same, and so is the
runtime.  But what I was testing before that was a version that
ignored the result of pg_atomic_fetch_add_u32,

        while (count-- > 0)
        {
                (void) pg_atomic_fetch_add_u32(myptr, 1);
        }

and what I was quite surprised to see was a single-thread time of
9600ms and a two-thread time of ~40s.  The reason was not too far
to seek: gcc is smart enough to notice that it doesn't need the
result of pg_atomic_fetch_add_u32, and so it compiles that to just

        lock addl       $1, (%rax)

which is evidently significantly more efficient than the xaddl under
contention load.

Or in words of one syllable: at least for pg_atomic_fetch_add_u32,
we are working hard in atomics/arch-x86.h to get worse code than
gcc would give us natively.  (And, in case you didn't notice, this
is far from the latest and shiniest gcc.)

This case is not to be dismissed as insignificant either, since of the
three non-test occurrences of pg_atomic_fetch_add_u32 in our tree, two
ignore the result.

So I think we'd be well advised to cast a doubtful eye at the asm
constructs we've got here, and figure out which ones are really
meaningfully smarter than gcc's primitives.

                        regards, tom lane

#include "postgres.h"

#include "fmgr.h"
#include "storage/lwlock.h"
#include "storage/shmem.h"


PG_MODULE_MAGIC;

static pg_atomic_uint32 *globptr = NULL;

int32 globjunk = 0;

PG_FUNCTION_INFO_V1(my_test_atomic_ops);

Datum
my_test_atomic_ops(PG_FUNCTION_ARGS)
{
	int64		count = PG_GETARG_INT64(0);
	int32 result;
	pg_atomic_uint32 *myptr;
	int32 junk = 0;

	if (globptr == NULL)
	{
		/* First time through in this process; find shared memory */
		bool		found;

		LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);

		globptr = ShmemInitStruct("my_test_atomic_ops",
								sizeof(*globptr),
								&found);

		if (!found)
		{
			/* First time through anywhere */
			pg_atomic_init_u32(globptr, 0);
		}

		LWLockRelease(AddinShmemInitLock);
	}

	myptr = globptr;

	while (count-- > 0)
	{
		junk += pg_atomic_fetch_add_u32(myptr, 1);
	}

	globjunk += junk;

	result = pg_atomic_read_u32(myptr);

	PG_RETURN_INT32(result);
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] atomics/arch-x86.h is stupider than atomics/generic-gcc.h?

Reply via email to