On Fri, Apr 18, 2025 at 7:48 PM Bertrand Drouvot <bertranddrouvot...@gmail.com> wrote: > > Hi, > > On Thu, Apr 17, 2025 at 01:58:44AM +1200, Thomas Munro wrote: > > On Wed, Apr 16, 2025 at 9:14 PM Jakub Wartak > > <jakub.war...@enterprisedb.com> wrote: > > > 2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an > > > expert on DSA/DSM at all) > > > > I have no answers but I have speculated for years about a very > > specific case (without any idea where to begin due to lack of ... I > > guess all this sort of stuff): in ExecParallelHashJoinNewBatch(), > > workers split up and try to work on different batches on their own to > > minimise contention, and when that's not possible (more workers than > > batches, or finishing their existing work at different times and going > > to help others), they just proceed in round-robin order. A beginner > > thought is: if you're going to help someone working on a hash table, > > it would surely be best to have the CPUs and all the data on the same > > NUMA node. During loading, cache line ping pong would be cheaper, and > > during probing, it *might* be easier to tune explicit memory prefetch > > timing that way as it would look more like a single node system with a > > fixed latency, IDK (I've shared patches for prefetching before that > > showed pretty decent speedups, and the lack of that feature is > > probably a bigger problem than any of this stuff, who knows...). > > Another beginner thought is that the DSA allocator is a source of > > contention during loading: the dumbest problem is that the chunks are > > just too small, but it might also be interesting to look into per-node > > pools. Or something. IDK, just some thoughts... > > I'm also thinking that could be beneficial for parallel workers. I think the > ideal scenario would be to have the parallel workers spread across numa nodes > and > accessing their "local" memory first (and help with "remote" memory access if > there is still more work to do "remotely").
Hi Bertrand, I've played with CPU pinning of PQ workers (via adjusting postmaster pin), but I've got quite opposite results - please see attached, especially "lat"ency against how the CPUs were assigned VS NUMA/s_b when it was not interleaved. Not that I intend to spend a lot of time researching PQ vs NUMA , but I've included interleaving of PQ shm segments too in the v4 patch in the subthread nearby. Those attached results here, were made some time ago with v1 of the patch where PQ shm segment was not interleaved. If anything it would be to hear if there are any sensible production-like scenarios/workloads when dynamic_shared_memory should be set to sysv or mmap (instead of default posix) ? Asking for Linux only, I couldn't imagine anything (?) -J.
On legacy 4s/NUMA simple simplest query to `select sum(aid) from pgbench_accounts_1 (single partition, 400MB)` with pgbench -c 1 client and small s_b 4GB just taking single NUMA node, gets this: Finalize Aggregate (cost=72150.70..72150.71 rows=1 width=8) (actual time=252.817..254.409 rows=1.00 loops=1) Output: sum(aid) Buffers: shared hit=54798 -> Gather (cost=72150.49..72150.69 rows=2 width=8) (actual time=252.319..254.381 rows=3.00 loops=1) Output: (PARTIAL sum(aid)) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=54798 -> Partial Aggregate (cost=71150.49..71150.49 rows=1 width=8) (actual time=247.262..247.264 rows=1.00 loops=3) Output: PARTIAL sum(aid) Buffers: shared hit=54798 Worker 0: actual time=245.116..245.118 rows=1.00 loops=1 Buffers: shared hit=19185 Worker 1: actual time=244.797..244.800 rows=1.00 loops=1 Buffers: shared hit=19021 -> Parallel Seq Scan on public.pgbench_accounts_1 (cost=0.00..67879.99 rows=1308199 width=4) (actual time=0.011..136.895 rows=1041666.67 loop s=3) Output: aid, bid, abalance, filler Buffers: shared hit=54798 Worker 0: actual time=0.010..133.393 rows=1094053.00 loops=1 Buffers: shared hit=19185 Worker 1: actual time=0.009..134.281 rows=1084465.00 loops=1 Buffers: shared hit=19021 Observe "lat"ency metric , it primary depends on the where the process are placed: no PQ, default behavoir: progress: 57.0 s, 2.0 tps, lat 515.312 ms stddev 0.053, 0 failed progress: 58.0 s, 2.0 tps, lat 609.009 ms stddev 92.653, 0 failed progress: 59.0 s, 2.0 tps, lat 515.414 ms stddev 0.423, 0 failed progress: 60.0 s, 2.0 tps, lat 515.530 ms stddev 0.138, 0 failed progress: 61.0 s, 2.0 tps, lat 515.370 ms stddev 0.181, 0 failed progress: 62.0 s, 2.0 tps, lat 515.396 ms stddev 0.351, 0 failed progress: 63.0 s, 1.0 tps, lat 515.484 ms stddev NaN, 0 failed progress: 64.0 s, 2.0 tps, lat 515.428 ms stddev 0.035, 0 failed progress: 65.0 s, 2.0 tps, lat 515.357 ms stddev 0.089, 0 failed progress: 66.0 s, 2.0 tps, lat 608.885 ms stddev 92.633, 0 failed progress: 67.0 s, 2.0 tps, lat 515.828 ms stddev 0.501, 0 failed progress: 68.0 s, 1.0 tps, lat 603.824 ms stddev NaN, 0 failed no PQ, manual pinning (above it flunctuates a lot due to migrations): progress: 318.0 s, 3.0 tps, lat 321.714 ms stddev 0.089, 0 failed <-- bind backend to CPU on the same NUMA node, gets 1.4GB/s DRAM read on that whole socket/OS progress: 319.0 s, 3.0 tps, lat 321.630 ms stddev 0.254, 0 failed progress: 320.0 s, 3.0 tps, lat 321.655 ms stddev 0.131, 0 failed progress: 321.0 s, 3.0 tps, lat 321.555 ms stddev 0.132, 0 failed progress: 322.0 s, 3.0 tps, lat 322.027 ms stddev 0.964, 0 failed progress: 323.0 s, 2.0 tps, lat 426.245 ms stddev 49.599, 0 failed progress: 324.0 s, 2.0 tps, lat 519.873 ms stddev 0.160, 0 failed <-- bind query to CPU on different NUMA node (hits interconnect), gets only ~0.8GB/s read on that socket with s_b progress: 325.0 s, 2.0 tps, lat 519.854 ms stddev 0.488, 0 failed progress: 326.0 s, 2.0 tps, lat 519.604 ms stddev 0.285, 0 failed progress: 327.0 s, 2.0 tps, lat 519.626 ms stddev 0.190, 0 failed progress: 328.0 s, 2.0 tps, lat 520.172 ms stddev 0.064, 0 failed progress: 329.0 s, 3.0 tps, lat 409.836 ms stddev 41.466, 0 failed progress: 330.0 s, 3.0 tps, lat 325.236 ms stddev 0.248, 0 failed <-- bind backend to CPU on the same NUMA node again progress: 331.0 s, 3.0 tps, lat 325.190 ms stddev 0.261, 0 failed progress: 332.0 s, 3.0 tps, lat 325.485 ms stddev 0.751, 0 failed progress: 333.0 s, 3.0 tps, lat 325.026 ms stddev 0.065, 0 failed 520/325 ms = 1.6x; IPC jumps from 1.23 to ~2.00 just due to switch from remote to local DRAM memory access (latency) no PQ and numa=on (s_b interleaved), STABLE (low stddev), notice how optimal was ~321ms vs interleaved gets ~463ms yes it is wrose (321->463ms), but we have uniform latency, instead of high variance depending on how scheduler assigns CPU: progress: 10.0 s, 2.0 tps, lat 463.378 ms stddev 0.181, 0 failed progress: 11.0 s, 2.0 tps, lat 463.422 ms stddev 0.016, 0 failed progress: 12.0 s, 2.0 tps, lat 463.121 ms stddev 0.052, 0 failed progress: 13.0 s, 2.0 tps, lat 463.198 ms stddev 0.148, 0 failed progress: 14.0 s, 3.0 tps, lat 463.823 ms stddev 0.854, 0 failed progress: 15.0 s, 2.0 tps, lat 462.937 ms stddev 0.199, 0 failed ok let's enable PQ: 2 workers and leader participating, default: progress: 30.0 s, 7.0 tps, lat 134.692 ms stddev 4.644, 0 failed progress: 31.0 s, 8.0 tps, lat 129.489 ms stddev 1.510, 0 failed progress: 32.0 s, 7.0 tps, lat 130.506 ms stddev 3.393, 0 failed progress: 33.0 s, 8.0 tps, lat 133.785 ms stddev 1.921, 0 failed progress: 34.0 s, 7.0 tps, lat 134.561 ms stddev 6.965, 0 failed progress: 35.0 s, 8.0 tps, lat 135.438 ms stddev 5.286, 0 failed progress: 36.0 s, 7.0 tps, lat 141.760 ms stddev 15.878, 0 failed progress: 37.0 s, 7.0 tps, lat 136.749 ms stddev 5.937, 0 failed progress: 38.0 s, 7.0 tps, lat 138.258 ms stddev 8.666, 0 failed progress: 39.0 s, 7.0 tps, lat 140.263 ms stddev 8.834, 0 failed 2 workers and leader participating and postmaster CPU affinity (therefore any new pq workers) set to only CPUs from nodes having s_b + CPU-pin the leader there too: progress: 342.0 s, 9.0 tps, lat 117.306 ms stddev 0.152, 0 failed progress: 343.0 s, 8.0 tps, lat 117.265 ms stddev 0.216, 0 failed progress: 344.0 s, 9.0 tps, lat 117.216 ms stddev 0.221, 0 failed progress: 345.0 s, 9.0 tps, lat 117.233 ms stddev 0.155, 0 failed postmaster's CPU affinity (so for 2 PQ workers) being forced to remote NUMA node: progress: 346.0 s, 6.0 tps, lat 142.502 ms stddev 12.621, 0 failed progress: 347.0 s, 6.0 tps, lat 168.971 ms stddev 20.585, 0 failed progress: 348.0 s, 7.0 tps, lat 157.017 ms stddev 9.333, 0 failed progress: 349.0 s, 6.0 tps, lat 157.810 ms stddev 15.124, 0 failed progress: 350.0 s, 7.0 tps, lat 150.623 ms stddev 1.246, 0 failed progress: 351.0 s, 6.0 tps, lat 156.901 ms stddev 15.012, 0 failed 2 PQ workers on the local NUMA node to s_b + leader process forced to remote NUMA node: progress: 554.0 s, 7.0 tps, lat 130.552 ms stddev 0.256, 0 failed progress: 555.0 s, 8.0 tps, lat 130.488 ms stddev 0.144, 0 failed progress: 556.0 s, 8.0 tps, lat 130.795 ms stddev 0.171, 0 failed progress: 557.0 s, 7.0 tps, lat 130.759 ms stddev 0.250, 0 failed progress: 558.0 s, 8.0 tps, lat 130.601 ms stddev 0.217, 0 failed progress: 559.0 s, 8.0 tps, lat 130.606 ms stddev 0.151, 0 failed progress: 560.0 s, 7.0 tps, lat 130.503 ms stddev 0.217, 0 failed 2 PQ workers on remote NUMA node to s_b + leader process on local progress: 627.0 s, 6.0 tps, lat 150.555 ms stddev 0.147, 0 failed progress: 628.0 s, 7.0 tps, lat 150.675 ms stddev 0.484, 0 failed progress: 629.0 s, 7.0 tps, lat 150.617 ms stddev 0.324, 0 failed progress: 630.0 s, 6.0 tps, lat 150.410 ms stddev 0.087, 0 failed progress: 631.0 s, 7.0 tps, lat 150.493 ms stddev 0.150, 0 failed progress: 632.0 s, 7.0 tps, lat 150.566 ms stddev 0.095, 0 failed With PQ (2 workers and leader participating) with numa=on (interleaved s_b), CPU affinity default, stddev flunctuates 2-23ms progress: 39.0 s, 6.0 tps, lat 175.806 ms stddev 10.968, 0 failed progress: 40.0 s, 6.0 tps, lat 172.568 ms stddev 10.002, 0 failed progress: 41.0 s, 5.0 tps, lat 175.535 ms stddev 9.343, 0 failed progress: 42.0 s, 6.0 tps, lat 170.761 ms stddev 6.165, 0 failed progress: 43.0 s, 6.0 tps, lat 164.163 ms stddev 1.575, 0 failed progress: 44.0 s, 6.0 tps, lat 175.584 ms stddev 13.247, 0 failed progress: 45.0 s, 6.0 tps, lat 162.442 ms stddev 0.175, 0 failed but we have 4x (number of sockets) more CPU power available at the same latency/bandwidth. So if anything: - latency-wise: it would be best to place leader+all PQ workers close to s_b, provided s_b fits NUMA shared/huge page memory there and you won't need more CPU than there's on that NUMA node... (assuming e.g. hosting 4 DBs on 4-sockets each on it's own, it would be best to pin everything including shm, but PQ workers too) - capacity/TPS-wise or s_b > NUMA: just interleave to maximize bandwidth and get uniform CPU performance out of this