On Mon, Aug 15, 2016 at 6:55 AM, Robert Haas <robertmh...@gmail.com> wrote:
> The simple version of this is that when a worker gets done with its
> own probe phase for batch X, it can immediately start building the
> hash table for phase X+1, stopping if it fills up the unused portion
> of work_mem before the old hash table goes away.  Of course, there are
> some tricky issues with reading tapes that were originally created by
> other backends, but if I understand correctly, Peter Geoghegan has
> already done some work on that problem, and it seems like something we
> can eventually solve, even if not in the first version.

The tape vs. BufFile vs. fd.c file handle distinctions get
*confusing*. Thomas and I have hashed this out (pun intended), but I
should summarize.

Currently, and without bringing parallelism into it, Hash joins have
multiple BufFiles (two per batch -- innerBatchFile and
outerBatchFile), which are accessed as needed. External sorts have
only one BufFile, with multiple "logical tapes" within a single
"tapeset" effectively owning space within the BufFile -- that space
doesn't have to be contiguous, and can be reused *eagerly* within and
across logical tapes in tuplesort.c's tapeset. logtape.c is a kind of
block-orientated rudimentary filesystem built on top of one BufFile.
The only real advantage of having the logtape.c abstraction is that
moving stuff around (to sort it, when multiple passes are required)
can be accomplished with minimal wasted disk space (it's eagerly
reclaimed). This is less important today than it would have been in
the past.

Clearly, it doesn't make much sense to talk about logtape.c and
anything that isn't sorting, because it is very clearly written with
that purpose alone in mind. To avoid confusion, please only talk about
tapes when talking about sorting.


* tuplesort.c always talks to logtape.c, which talks to buffile.c
(which talks to fd.c).

* Hash joins use buffile.c directly, though (and have multiple
buffiles, as already noted).

Now, I might still have something that Thomas can reuse, because
buffile.c was made to support "unification" of worker BufFiles in
general. Thomas would be using that interface, if any. I haven't
studied parallel hash join at all, but presumably the difference would
be that *multiple* BufFiles would be unified, such that a
concatenated/unified BufFile would be addressable within each worker,
one per batch. All of this assumes that there is a natural way of
unifying the various batches involved across all workers, of course.

This aspect would present some complexity for Thomas, I think
(comments from hashjoin.h):

 * It is possible to increase nbatch on the fly if the in-memory hash table
 * gets too big.  The hash-value-to-batch computation is arranged so that this
 * can only cause a tuple to go into a later batch than previously thought,
 * never into an earlier batch.  When we increase nbatch, we rescan the hash
 * table and dump out any tuples that are now of a later batch to the correct
 * inner batch file.  Subsequently, while reading either inner or outer batch
 * files, we might find tuples that no longer belong to the current batch;
 * if so, we just dump them out to the correct batch file.

I'd be concerned about managing which backend was entitled to move
tuples across batches, and so on. One thing that I haven't had to
contend with is which backend "owns" which BufFile (or underlying fd.c
file handles). There is no ambiguity about that for me. Owners delete
the temp files on Xact end, and are the only ones entitled to write to
files, and only before unification. These latter restrictions might be
lifted if there was a good reason to do so.

Peter Geoghegan

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to