> 2) The spinlocks both on the the sending and receiving side a quite hot:
>    rafia query leader:
> +   36.16%  postgres  postgres            [.] shm_mq_receive
> +   19.49%  postgres  postgres            [.] s_lock
> +   13.24%  postgres  postgres            [.] SetLatch

Here's a patch which, as per an off-list discussion between Andres,
Amit, and myself, removes the use of the spinlock for most
send/receive operations in favor of memory barriers and the atomics
support for 8-byte reads and writes.  I tested with a pgbench -i -s
300 database with pgbench_accounts_pkey dropped and
max_parallel_workers_per_gather boosted to 10.  I used this query:

select aid, count(*) from pgbench_accounts group by 1 having count(*) > 1;

which produces this plan:

 Finalize GroupAggregate  (cost=1235865.51..5569468.75 rows=10000000 width=12)
   Group Key: aid
   Filter: (count(*) > 1)
   ->  Gather Merge  (cost=1235865.51..4969468.75 rows=30000000 width=12)
         Workers Planned: 6
         ->  Partial GroupAggregate  (cost=1234865.42..1322365.42
rows=5000000 width=12)
               Group Key: aid
               ->  Sort  (cost=1234865.42..1247365.42 rows=5000000 width=4)
                     Sort Key: aid
                     ->  Parallel Seq Scan on pgbench_accounts
(cost=0.00..541804.00 rows=5000000 width=4)
(10 rows)

On hydra (PPC), these changes didn't help much.  Timings:

master: 29605.582, 29753.417, 30160.485
patch: 28218.396, 27986.951, 26465.584

That's about a 5-6% improvement.  On my MacBook, though, the
improvement was quite a bit more:

master: 21436.745, 20978.355, 19918.617
patch: 15896.573, 15880.652, 15967.176

Median-to-median, that's about a 24% improvement.

Any reviews appreciated.


