On Mon, Aug 15, 2016 at 6:55 AM, Robert Haas <robertmh...@gmail.com> wrote: > The simple version of this is that when a worker gets done with its > own probe phase for batch X, it can immediately start building the > hash table for phase X+1, stopping if it fills up the unused portion > of work_mem before the old hash table goes away. Of course, there are > some tricky issues with reading tapes that were originally created by > other backends, but if I understand correctly, Peter Geoghegan has > already done some work on that problem, and it seems like something we > can eventually solve, even if not in the first version.
The tape vs. BufFile vs. fd.c file handle distinctions get *confusing*. Thomas and I have hashed this out (pun intended), but I should summarize. Currently, and without bringing parallelism into it, Hash joins have multiple BufFiles (two per batch -- innerBatchFile and outerBatchFile), which are accessed as needed. External sorts have only one BufFile, with multiple "logical tapes" within a single "tapeset" effectively owning space within the BufFile -- that space doesn't have to be contiguous, and can be reused *eagerly* within and across logical tapes in tuplesort.c's tapeset. logtape.c is a kind of block-orientated rudimentary filesystem built on top of one BufFile. The only real advantage of having the logtape.c abstraction is that moving stuff around (to sort it, when multiple passes are required) can be accomplished with minimal wasted disk space (it's eagerly reclaimed). This is less important today than it would have been in the past. Clearly, it doesn't make much sense to talk about logtape.c and anything that isn't sorting, because it is very clearly written with that purpose alone in mind. To avoid confusion, please only talk about tapes when talking about sorting. So: * tuplesort.c always talks to logtape.c, which talks to buffile.c (which talks to fd.c). * Hash joins use buffile.c directly, though (and have multiple buffiles, as already noted). Now, I might still have something that Thomas can reuse, because buffile.c was made to support "unification" of worker BufFiles in general. Thomas would be using that interface, if any. I haven't studied parallel hash join at all, but presumably the difference would be that *multiple* BufFiles would be unified, such that a concatenated/unified BufFile would be addressable within each worker, one per batch. All of this assumes that there is a natural way of unifying the various batches involved across all workers, of course. This aspect would present some complexity for Thomas, I think (comments from hashjoin.h): * It is possible to increase nbatch on the fly if the in-memory hash table * gets too big. The hash-value-to-batch computation is arranged so that this * can only cause a tuple to go into a later batch than previously thought, * never into an earlier batch. When we increase nbatch, we rescan the hash * table and dump out any tuples that are now of a later batch to the correct * inner batch file. Subsequently, while reading either inner or outer batch * files, we might find tuples that no longer belong to the current batch; * if so, we just dump them out to the correct batch file. I'd be concerned about managing which backend was entitled to move tuples across batches, and so on. One thing that I haven't had to contend with is which backend "owns" which BufFile (or underlying fd.c file handles). There is no ambiguity about that for me. Owners delete the temp files on Xact end, and are the only ones entitled to write to files, and only before unification. These latter restrictions might be lifted if there was a good reason to do so. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers