On Tue, May 21, 2024 at 9:11 AM Melanie Plageman <melanieplage...@gmail.com> wrote: > So, if you are seeing the slow-down mostly go away by reducing > blocknums array size, does the regression only appear when the scan > data is fully in shared buffers? Or is this blocknums other use > (dealing with short reads)?
That must be true (that blocknums array is normally only "filled" in the "fast path", where all buffers are found in cache). > Is your theory that one worker ends up reading 16 blocks that should > have been distributed across multiple workers? Yes, it just jiggles the odds around a bit, introducing a bit of extra unfairness by calling the callback in a tighter loop to build a little batch, revealing a pre-existing problem. The mistake in PHJ (problem #2 above) is that, once a worker decides it would like all workers to stop inserting so it can increase the number of buckets, it sets a flag to ask them to do that, and waits for them to see it, but if there is a worker filtering all tuples out, it never checks the "growth" flag. So it scans all the way to the end while the other guy waits. Normally it checks that flag when it is time to allocate a new chunk of memory, which seemed to make sense to me at the time: if we've hit the needs-more-buckets (or needs-more-batches) logic, then surely workers are inserting tuples and will soon allocate a new chunk! But, of course, here is the edge case where that isn't true: we had bad estimates so hash table too small (problem #1), we got lots of tuples clustered over a few heap pages and decided to expand the hash table, but right at that moment, matching tuples ran out so somebody had to finish the whole scan without ever checking the flag (problem #2), and that someone happened to have all the rest of the pages because we made the lookahead a bit less fair (problem #3). Nice confluence of problems. I expect #2 and #3 to be easy to fix, and I didn't look at the estimation problem #1 at all (perhaps a stats puzzle designed by the TPC to trip us up?).