On Thu, Apr 17, 2025 at 1:58 AM Thomas Munro <thomas.mu...@gmail.com> wrote: > I have no answers but I have speculated for years about a very > specific case (without any idea where to begin due to lack of ... I > guess all this sort of stuff): in ExecParallelHashJoinNewBatch(), > workers split up and try to work on different batches on their own to > minimise contention, and when that's not possible (more workers than > batches, or finishing their existing work at different times and going > to help others), they just proceed in round-robin order. A beginner > thought is: if you're going to help someone working on a hash table, > it would surely be best to have the CPUs and all the data on the same > NUMA node. During loading, cache line ping pong would be cheaper, and > during probing, it *might* be easier to tune explicit memory prefetch > timing that way as it would look more like a single node system with a > fixed latency, IDK (I've shared patches for prefetching before that > showed pretty decent speedups, and the lack of that feature is > probably a bigger problem than any of this stuff, who knows...). > Another beginner thought is that the DSA allocator is a source of > contention during loading: the dumbest problem is that the chunks are > just too small, but it might also be interesting to look into per-node > pools. Or something. IDK, just some thoughts...
And BTW there are papers about that (but they mostly just remind me that I have to reboot the prefetching patch long before that...), for example: https://15721.courses.cs.cmu.edu/spring2023/papers/11-hashjoins/lang-imdm2013.pdf