Hi, On 2023-05-17 12:53:15 -0700, Andres Freund wrote: > I'll try to come up with a benchmark without the issues I pointed out in > https://postgr.es/m/20230517194331.ficfy5brpfq5lrmz%40awork3.anarazel.de
Here we go: setup: create primary SELECT pg_create_physical_replication_slot('reserve', true); create standby using pg_basebackup create WAL: psql -c CREATE TABLE testtable_logged(other_data int default 1);' && \ c=16; PGOPTIONS='-c synchronous_commit=off' /path-to-pgbench --random-seed=0 -n -c$c -j$c -t1000 -P1 -f <( echo "INSERT INTO testtable_logged SELECT generate_series(1, 1000)" ) && \ psql -c "SELECT pg_create_restore_point('end');" benchmark: rm -rf /tmp/test && \ cp -ar /srv/dev/pgdev-dev-standby /tmp/test && \ cp -ar /srv/dev/pgdev-dev/pg_wal/* /tmp/test/pg_wal/ && \ sync && \ /usr/bin/time -f '%es' /path-to-postgres -D /tmp/test -c recovery_target_action=shutdown -c recovery_target_name=end -c shared_buffers=1GB -c fsync=off That way I can measure how long it takes to replay exactly the same WAL, and also take profiles of exactly the same work, without influencing time results. I copy the WAL files to the primary to ensure that walreceiver (standby) / walsender (primary) performance doesn't make the result variability higher. max_walsenders=10 max_walsenders=100 e101dfac3a5 reverted 7.01s 7.02s 093e5c57d50 / HEAD 8.25s 19.91s bharat-v3 7.14s 7.13s So indeed, bharat-v3 largely fixes the issue. The regression of v3 compared to e101dfac3a5 reverted seems pretty constant at ~0.982x, independent of the concrete max_walsenders value. Which makes sense, the work is constant. To make it more extreme, I also tested a workload that is basically free to replay: c=16; /srv/dev/build/m-opt/src/bin/pgbench/pgbench --random-seed=0 -n -c$c -j$c -t5000 -P1 -f <( echo "SELECT pg_logical_emit_message(false, 'c', 'a') FROM generate_series(1, 1000)" ) && psql -c "SELECT pg_create_restore_point('end');" max_walsenders=10 max_walsenders=100 e101dfac3a5 reverted 1.70s 1.70s 093e5c57d50 / HEAD 3.00s 14.56s bharat-v3 1.88s 1.88s In this extreme workload we still regress by ~0.904x. I'm not sure how much it's worth worrying about that - this is a quite unrealistic testcase. FWIW, if I just make WalSndWakeup() do nothing, I still see a very small, but reproducible, overhead: 1.72s - that's just the cost of the additional external function call. If I add a no-waiters fastpath using proclist_is_empty() to ConditionVariableBroadcast(), I get 1.77s. So the majority of the remaining slowdown indeed comes from the spinlock acquisition in ConditionVariableBroadcast(). Greetings, Andres Freund