Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Thomas Munro Thu, 09 Feb 2017 15:39:25 -0800

On Fri, Feb 10, 2017 at 11:31 AM, Robert Haas <[email protected]> wrote:
> On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro
> <[email protected]> wrote:
>> I agree that the pinned segment case doesn't matter right now, I just
>> wanted to point it out.  I like your $10 wrench analogy, but maybe it
>> could be argued that adding a dsm_on_destroy() callback mechanism is
>> not only better than adding another refcount to track that other
>> refcount, but also a steal at only $8.
>
> If it's that simple, it might be worth doing, but I bet it's not.  One
> problem is that there's a race condition: there will inevitably be a
> period of time after you've called dsm_attach() and before you've
> attached to the specific data structure that we're talking about here.
> So suppose the last guy who actually knows about this data structure
> dies horribly and doesn't clean up because the DSM isn't being
> destroyed; moments later, you die horribly before reaching the code
> where you attach to this data structure.  Oops.


Right, I mentioned this problem earlier ("and also register the
on_dsm_detach callback, before any chance that an error might be
thrown (is that difficult?); failure to do so could result in file
leaks").

Here's my thought process... please tell me where I'm going wrong:

I have been assuming that it's not enough to just deal with this when
the leader detaches on the theory that other participants will always
detach first: that probably isn't true in some error cases, and could
contribute to spurious racy errors where other workers complain about
disappearing files if the leader somehow shuts down and cleans up
while a worker is still running.  Therefore we need *some* kind of
refcounting, whether it's a new kind or a new mechanism based on the
existing kind.

I have also been assuming that we don't want to teach dsm.c directly
about this stuff; it shouldn't need to know about other modules, so we
don't want it talking to buffile.c directly and managing a special
table of files; instead we want a system of callbacks.  Therefore
client code needs to do something after attaching to the segment in
each backend.

It doesn't matter whether we use an on_dsm_detach() callback and
manage our own refcount to infer that destruction is imminent, or a
new on_dsm_destroy() callback which tells us so explicitly: both ways
we'll need to make sure that anyone who attaches to the segment also
"attaches" to this shared BufFile manager object inside it, because
any backend might turn out to be the one that is last to detach.

That bring us to the race you mentioned.  Isn't it sufficient to say
that you aren't allowed to do anything that might throw in between
attaching to the segment and attaching to the SharedBufFileManager
that it contains?

Up until two minutes ago I assumed that policy would leave only two
possibilities: you attach to the DSM segment and attach to the
SharedBufFileManager successfully or you attach to the DSM segment and
then die horribly (but not throw) and the postmaster restarts the
whole cluster and blows all temp files away with RemovePgTempFiles().
But I see now in the comment of that function that crash-induced
restarts don't call that because "someone might want to examine the
temp files for debugging purposes".  Given that policy for regular
private BufFiles, I don't see why that shouldn't apply equally to
shared files: after a crash restart, you may have some junk files that
won't be cleaned up until your next clean restart, whether they were
private or shared BufFiles.

> You might think about plugging that hole by moving the registry of
> on-destroy functions into the segment itself and making it a shared
> resource.  But ASLR breaks that, especially for loadable modules.  You
> could try to fix that problem, in turn, by storing arguments that can
> later be passed to load_external_function() instead of a function
> pointer per se.  But that sounds pretty fragile because some other
> backend might not try to load the module until after it's attached the
> DSM segment and it might then fail because loading the module runs
> _PG_init() which can throw errors.   Maybe you can think of a way to
> plug that hole too but you're waaaaay over your $8 budget by this
> point.

Agreed, those approaches seem like a non-starters.

>> My problem here is that I don't know how many batches I'll finish up
>> creating.  [...]
>
> I thought the idea was that the structure we're talking about here
> owns all the files, up to 2 from a leader that wandered off plus up to
> 2 for each worker.  Last process standing removes them.  Or are you
> saying each worker only needs 2 files but the leader needs a
> potentially unbounded number?

Yes, potentially unbounded in rare case.  If we plan for N batches,
and then run out of work_mem because our estimates were just wrong or
the distributions of keys is sufficiently skewed, we'll run
HashIncreaseNumBatches, and that could happen more than once.  I have
a suite of contrived test queries that hits all the various modes and
code paths of hash join, and it includes a query that plans for one
batch but finishes up creating many, and then the leader exits.  I'll
post that to the other thread along with my latest patch series soon.

>> Perhaps we can find a way to describe a variable number of BufFiles
>> (ie batches) in a fixed space by making sure the filenames are
>> constructed in a way that lets us just have to say how many there are.
>
> That could be done.

Cool.

>> Then the next problem is that for each BufFile we have to know how
>> many 1GB segments there are to unlink (files named foo, foo.1, foo.2,
>> ...), which Peter's code currently captures by publishing the file
>> size in the descriptor... but if a fixed size object must describe N
>> BufFiles, where can I put the size of each one?  Maybe I could put it
>> in a header of the file itself (yuck!), or maybe I could decide that I
>> don't care what the size is, I'll simply unlink "foo", then "foo.1",
>> then "foo.2", ... until I get ENOENT.
>
> There's nothing wrong with that algorithm as far as I'm concerned.

Cool.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to