On Wed, Dec 21, 2016 at 7:04 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote: >> If the worker is always completely finished with the tape before the >> leader touches it, couldn't the leader's LogicalTapeSet just "adopt" >> the tape and overwrite it like any other? > > Currently, the logical tape code assumes that all tapes in a single > LogicalTapeSet are allocated from the same BufFile. The logical tape's > on-disk format contains block numbers, to point to the next/prev block of > the tape , and they're assumed to refer to the same file. That allows > reusing space efficiently during the merge. After you have read the first > block from tapes A, B and C, you can immediately reuse those three blocks > for output tape D.
I see. Hmm. > Now, if you read multiple tapes, from different LogicalTapeSet, hence backed > by different BufFiles, you cannot reuse the space from those different tapes > for a single output tape, because the on-disk format doesn't allow referring > to blocks in other files. You could reuse the space of *one* of the input > tapes, by placing the output tape in the same LogicalTapeSet, but not all of > them. > > We could enhance that, by using "filename + block number" instead of just > block number, in the pointers in the logical tapes. Then you could spread > one logical tape across multiple files. Probably not worth it in practice, > though. OK, so the options as I understand them are: 1. Enhance the logical tape set infrastructure in the manner you mention, to support filename (or more likely a proxy for filename) + block number in the logical tape pointers. Then, tapes can be transferred from one LogicalTapeSet to another. 2. Enhance the BufFile infrastructure to support some notion of a shared BufFile so that multiple processes can be reading and writing blocks in the same BufFile. Then, extend the logical tape infrastruture so that we also have the notion of a shared LogicalTape. This means that things like ltsGetFreeBlock() need to be re-engineered to handle concurrency with other backends. 3. Just live with the waste of space. I would guess that (1) is easier than (2). Also, (2) might provoke contention while writing tapes that is otherwise completely unnecessary. It seems silly to have multiple backends fighting over the same end-of-file pointer for the same file when they could just write to different files instead. Another tangentially-related problem I just realized is that we need to somehow handle the issues that tqueue.c does when transferring tuples between backends -- most of the time there's no problem, but if anonymous record types are involved then tuples require "remapping". It's probably harder to provoke a failure in the tuplesort case than with parallel query per se, but it's probably not impossible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers