Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Thomas Munro Thu, 09 Feb 2017 14:10:12 -0800

On Fri, Feb 10, 2017 at 9:51 AM, Robert Haas <[email protected]> wrote:
> On Wed, Feb 8, 2017 at 5:36 AM, Thomas Munro
> <[email protected]> wrote:
>> Thinking a bit harder about this, I suppose there could be a kind of
>> object called a SharedBufFileManager [... description of that ...].
>
> I think this is approximately reasonable, but I think it could be made
> simpler by having fewer separate objects.  Let's assume the leader can
> put an upper bound on the number of shared BufFiles at the time it's
> sizing the DSM segment (i.e. before InitializeParallelDSM).  Then it
> can allocate a big ol' array with a header indicating the array size
> and each element containing enough space to identify the relevant
> details of 1 shared BufFile.  Now you don't need to do any allocations
> later on, and you don't need a linked list.  You just loop over the
> array and do what needs doing.


Makes sense.

>> There are a couple of problems with the above though.  Firstly, doing
>> reference counting in DSM segment on-detach hooks is really a way to
>> figure out when the DSM segment is about to be destroyed by keeping a
>> separate refcount in sync with the DSM segment's refcount, but it
>> doesn't account for pinned DSM segments.  It's not your use-case or
>> mine currently, but someone might want a DSM segment to live even when
>> it's not attached anywhere, to be reattached later.  If we're trying
>> to use DSM segment lifetime as a scope, we'd be ignoring this detail.
>> Perhaps instead of adding our own refcount we need a new kind of hook
>> on_dsm_destroy.
>
> I think it's good enough to plan for current needs now.  It's not
> impossible to change this stuff later, but we need something that
> works robustly right now without being too invasive.  Inventing whole
> new system concepts because of stuff we might someday want to do isn't
> a good idea because we may easily guess wrong about what direction
> we'll want to go in the future.  This is more like building a wrench
> than a 747: a 747 needs to be extensible and reconfigurable and
> upgradable because it costs $350 million.   A wrench costs $10 at
> Walmart and if it turns out we bought the wrong one, we can just throw
> it out and get a different one later.

I agree that the pinned segment case doesn't matter right now, I just
wanted to point it out.  I like your $10 wrench analogy, but maybe it
could be argued that adding a dsm_on_destroy() callback mechanism is
not only better than adding another refcount to track that other
refcount, but also a steal at only $8.

>> Secondly, I might not want to be constrained by a
>> fixed-sized DSM segment to hold my SharedBufFile objects... there are
>> cases where I need to shared a number of batch files that is unknown
>> at the start of execution time when the DSM segment is sized (I'll
>> write about that shortly on the Parallel Shared Hash thread).  Maybe I
>> can find a way to get rid of that requirement.  Or maybe it could
>> support DSA memory too, but I don't think it's possible to use
>> on_dsm_detach-based cleanup routines that refer to DSA memory because
>> by the time any given DSM segment's detach hook runs, there's no
>> telling which other DSM segments have been detached already, so the
>> DSA area may already have partially vanished; some other kind of hook
>> that runs earlier would be needed...
>
> Again, wrench.

My problem here is that I don't know how many batches I'll finish up
creating.  In general that's OK because I can hold onto them as
private BufFiles owned by participants with the existing cleanup
mechanism, and then share them just before they need to be shared (ie
when we switch to processing the next batch so they need to be
readable by all).  Now I only ever share one inner and one outer batch
file per participant at a time, and then I explicitly delete them at a
time that I know to be safe and before I need to share a new file that
would involve recycling the slot, and I'm relying on DSM segment scope
cleanup only to handle error paths.  That means that in generally I
only need space for 2 * P shared BufFiles at a time.  But there is a
problem case: when the leader needs to exit early, it needs to be able
to transfer ownership of any files it has created, which could be more
than we planned for, and then not participate any further in the hash
join, so it can't participate in the on-demand sharing scheme.

Perhaps we can find a way to describe a variable number of BufFiles
(ie batches) in a fixed space by making sure the filenames are
constructed in a way that lets us just have to say how many there are.
Then the next problem is that for each BufFile we have to know how
many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
...), which Peter's code currently captures by publishing the file
size in the descriptor... but if a fixed size object must describe N
BufFiles, where can I put the size of each one?  Maybe I could put it
in a header of the file itself (yuck!), or maybe I could decide that I
don't care what the size is, I'll simply unlink "foo", then "foo.1",
then "foo.2", ... until I get ENOENT.

Alternatively I might get rid of the requirement for the leader to
drop out of processing later batches.  I'm about to post a message to
the other thread about how to do that, but it's complicated and I'm
currently working on the assumption that the PSH patch is useful
without it (but let's not discuss that in this thread).  That would
have the side effect of getting rid of the requirement to share a
number of BufFiles that isn't known up front.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to