Re: Streaming read-ready sequential scan code

Thomas Munro Mon, 20 May 2024 14:35:49 -0700

On Tue, May 21, 2024 at 9:11 AM Melanie Plageman
<melanieplage...@gmail.com> wrote:
> So, if you are seeing the slow-down mostly go away by reducing
> blocknums array size, does the regression only appear when the scan
> data is fully in shared buffers? Or is this blocknums other use
> (dealing with short reads)?


That must be true (that blocknums array is normally only "filled" in
the "fast path", where all buffers are found in cache).

> Is your theory that one worker ends up reading 16 blocks that should
> have been distributed across multiple workers?

Yes, it just jiggles the odds around a bit, introducing a bit of extra
unfairness by calling the callback in a tighter loop to build a little
batch, revealing a pre-existing problem.

The mistake in PHJ (problem #2 above) is that, once a worker decides
it would like all workers to stop inserting so it can increase the
number of buckets, it sets a flag to ask them to do that, and waits
for them to see it, but if there is a worker filtering all tuples out,
it never checks the "growth" flag.  So it scans all the way to the end
while the other guy waits.  Normally it checks that flag when it is
time to allocate a new chunk of memory, which seemed to make sense to
me at the time: if we've hit the needs-more-buckets (or
needs-more-batches) logic, then surely workers are inserting tuples
and will soon allocate a new chunk!  But, of course, here is the edge
case where that isn't true: we had bad estimates so hash table too
small (problem #1), we got lots of tuples clustered over a few heap
pages and decided to expand the hash table, but right at that moment,
matching tuples ran out so somebody had to finish the whole scan
without ever checking the flag (problem #2), and that someone happened
to have all the rest of the pages because we made the lookahead a bit
less fair (problem #3).  Nice confluence of problems.  I expect #2 and
#3 to be easy to fix, and I didn't look at the estimation problem #1
at all (perhaps a stats puzzle designed by the TPC to trip us up?).

Re: Streaming read-ready sequential scan code

Reply via email to