On Fri, Oct 27, 2017 at 12:24 AM, Rushabh Lathia
<rushabh.lat...@gmail.com> wrote:
> While re-basing the parallel-B-tree-index-build patch on top v22 patch
> sets, found cosmetic review:


> 1) BufFileSetEstimate is removed but it's still into buffile.h
> +extern size_t BufFileSetEstimate(int stripes);


> 2) BufFileSetCreate is renamed with BufFileSetInit, but used at below
> place in comments:
> * Attach to a set of named BufFiles that was created with BufFileSetCreate.


Other minor tweaks: I fixed a harmless warning from Visual C++ and
ExecParallelHashJoinPartitionOuter()'s loop.

Two other changes:

1.  Improved concurrency for sharedtuplestore.c.

Last week I investigated some test failures on AppVeyor CI and
discovered a small problem with sharedtuplestore.c on Windows: it
could occasionally get ERROR_ACCESS_DENIED errors when attempting to
open files that were concurrently being unlinked (unlinking is not
atomic on Windows, see pgsql-bugs #14243 for another manifestation).
That code was a bit sloppy (though not incorrect), and was easy to fix
by doing some checks in a different order, but...

While hacking on that I realised that sharedtuplestore.c's parallel
scan efficiency was pretty terrible anyway, so I made an improvement
that I'd earlier threatened to make in a comment.  Instead of holding
a per-file lock while reading individual tuples, it now works in
page-multiple-sized chunks.  Readers only acquire a spinlock when they
need a new chunk, don't hold any locks while doing IO, and never read
overlapping pages.  From a user perspective, this means that EXPLAIN
(BUFFERS) temporary buffer read counts are approximately the same as
for the equivalent non-parallel hash join, because each worker reads a
disjoint set of temporary file pages instead of reading interleaved
tuples from the same pages, and there is no more LWLock "shared
tuplestore" that can show up in wait_event when backends pile up on
the same batch.  It writes very slightly more than it reads because of
unread pages at the end of the final chunk (because it reads back in
tuple-at-a-time and thus page-at-a-time, not whole chunk at a time --
I considered reading whole chunks and then returning pointer to
MinimalTuples in the chunk, but that required MAXALIGNing data in the
files on disk which made the files noticeably bigger).

So, to summarise, there are now three layers of contention avoidance
strategy being used by Parallel Hash Join for scanning batches in

i)  Participants in a Parallel Hash Join try to work on different
batches so they avoid scanning the same SharedTuplestore completely.
That's visible with EXPLAIN ANALYZE as "Batches Probed" (that shows
the number of outer batches scanned; it doesn't seem worth the pixels
to show "Batches Loaded" for the number of inner batches scanned which
may be lower).

ii)  When that's not possible, they start out reading from different
backing files by starting with the one they wrote themselves and then
go around the full set.  That means they don't contend on the per-file
read-head lock until a reader drains a whole file and then choses
another one that's still being read by someone else.

iii)  [New in this version] Since they might still finish up reading
from the same file (and often do towards the end of a join), the
tuples are chopped into multi-page chunks and participants are
allocated chunks using a spinlock-protected counter.  This is quite a
lot like Parallel Sequential Scan, with some extra complications due
to variable sized chunks.

2.  Improved oversized tuple handling.

I added a new regression test case to exercise the oversized tuple
support in ExecParallelHashLoadTuple() and sts_puttuple(), as
requested by Andres a while back.  (Thanks to Andrew Gierth for a
pointer on how to get detoasted tuples into a hash join table without
disabling parallelism.)  While testing that I realised that my
defences against some degenerate behaviour with very small work_mem
weren't quite good enough.  Previously, I adjusted space_allowed to
make sure every backend could allocate at least one memory chunk
without triggering an increase in the number of batches.  Now, I leave
space_allowed alone but explicitly allow every backend to allocate at
least one chunk ignoring the memory budget (whether regular chunk size
or oversized tuple size), to avoid futile repeated batch increases
when a single monster tuple is never going to fit in work_mem.

A couple of stupid things outstanding:

1.  EXPLAIN ANALYZE for Parallel Hash "actual" shows the complete row
count, which is interesting to know (or not?  maybe I should show it
somewhere else?), but the costing shows the partial row estimate used
for costing purposes.
2.  The BufFileSet's temporary directory gets created even if you
don't need it for batches.  Duh.
3.  I don't have a good query rescan regression query yet.  I wish I
could write my own query plans to test the executor.

Thomas Munro

Attachment: parallel-hash-v23.patchset.tgz
Description: GNU Zip compressed data

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to