Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Robert Haas Fri, 10 Feb 2017 04:54:07 -0800

On Thu, Feb 9, 2017 at 6:38 PM, Thomas Munro
<[email protected]> wrote:
> Here's my thought process... please tell me where I'm going wrong:
>
> I have been assuming that it's not enough to just deal with this when
> the leader detaches on the theory that other participants will always
> detach first: that probably isn't true in some error cases, and could
> contribute to spurious racy errors where other workers complain about
> disappearing files if the leader somehow shuts down and cleans up
> while a worker is still running.  Therefore we need *some* kind of
> refcounting, whether it's a new kind or a new mechanism based on the
> existing kind.


+1.

> I have also been assuming that we don't want to teach dsm.c directly
> about this stuff; it shouldn't need to know about other modules, so we
> don't want it talking to buffile.c directly and managing a special
> table of files; instead we want a system of callbacks.  Therefore
> client code needs to do something after attaching to the segment in
> each backend.

+1.

> It doesn't matter whether we use an on_dsm_detach() callback and
> manage our own refcount to infer that destruction is imminent, or a
> new on_dsm_destroy() callback which tells us so explicitly: both ways
> we'll need to make sure that anyone who attaches to the segment also
> "attaches" to this shared BufFile manager object inside it, because
> any backend might turn out to be the one that is last to detach.

Not entirely.  In the first case, you don't need the requirement that
everyone who attaches the segment must attach to the shared BufFile
manager.  In the second case, you do.

> That bring us to the race you mentioned.  Isn't it sufficient to say
> that you aren't allowed to do anything that might throw in between
> attaching to the segment and attaching to the SharedBufFileManager
> that it contains?

That would be sufficient, but I think it's not a very good design.  It
means, for example, that nothing between the time you attach to the
segment and the time you attach to this manager can palloc() anything.
So, for example, it would have to happen before ParallelWorkerMain
reaches the call to shm_mq_attach, which kinda sucks because we want
to do that as soon as possible after attaching to the DSM segment so
that errors are reported properly thereafter.  Note that's the very
first thing we do now, except for working out what the arguments to
that call need to be.

Also, while it's currently safe to assume that shm_toc_attach() and
shm_toc_lookup() don't throw errors, I've thought about the
possibility of installing some sort of cache in shm_toc_lookup() to
amortize the cost of lookups, if the number of keys ever got too
large.  And that would then require a palloc().  Generally, backend
code should be free to throw errors.  When it's absolutely necessary
for a short segment of code to avoid that, then we do, but you can't
really rely on any substantial amount of code to be that way, or stay
that way.

And in this case, even if we didn't mind those problems or had some
solution to them, I think that the shared buffer manager shouldn't
have to be something that is whacked directly into parallel.c all the
way at the beginning of the initialization sequence so that nothing
can fail before it happens.  I think it should be an optional data
structure that clients of the parallel infrastructure can decide to
use, or to not use.  It should be at arm's length from the core code,
just like the way ParallelQueryMain() is distinct from
ParallelWorkerMain() and sets up its own set of data structures with
their own set of keys.  All that stuff is happy to happen after
whatever ParallelWorkerMain() feels that it needs to do, even if
ParallelWorkerMain might throw errors for any number of unknown
reasons.  Similarly, I think this new things should be something than
an executor node can decide to create inside its own per-node space --
reserved via ExecParallelEstimate, initialized
ExecParallelInitializeDSM, etc.  There's no need for it to be deeply
coupled to parallel.c itself unless we force that choice by sticking a
no-fail requirement in there.

> Up until two minutes ago I assumed that policy would leave only two
> possibilities: you attach to the DSM segment and attach to the
> SharedBufFileManager successfully or you attach to the DSM segment and
> then die horribly (but not throw) and the postmaster restarts the
> whole cluster and blows all temp files away with RemovePgTempFiles().
> But I see now in the comment of that function that crash-induced
> restarts don't call that because "someone might want to examine the
> temp files for debugging purposes".  Given that policy for regular
> private BufFiles, I don't see why that shouldn't apply equally to
> shared files: after a crash restart, you may have some junk files that
> won't be cleaned up until your next clean restart, whether they were
> private or shared BufFiles.

I think most people (other than Tom) would agree that that policy
isn't really sensible any more; it probably made sense when the
PostgreSQL user community was much smaller and consisted mostly of the
people developing PostgreSQL, but these days it's much more likely to
cause operational headaches than to help a developer debug.
Regardless, I think the primary danger isn't failure to remove a file
(although that is best avoided) but removing one too soon (causing
someone else to error when opening it, or on Windows causing the
delete itself to error out).  It's not really OK for random stuff to
throw errors in corner cases because we were too lazy to ensure that
cleanup operations happen in the right order.

>> I thought the idea was that the structure we're talking about here
>> owns all the files, up to 2 from a leader that wandered off plus up to
>> 2 for each worker.  Last process standing removes them.  Or are you
>> saying each worker only needs 2 files but the leader needs a
>> potentially unbounded number?
>
> Yes, potentially unbounded in rare case.  If we plan for N batches,
> and then run out of work_mem because our estimates were just wrong or
> the distributions of keys is sufficiently skewed, we'll run
> HashIncreaseNumBatches, and that could happen more than once.  I have
> a suite of contrived test queries that hits all the various modes and
> code paths of hash join, and it includes a query that plans for one
> batch but finishes up creating many, and then the leader exits.  I'll
> post that to the other thread along with my latest patch series soon.

Hmm, OK.  So that's going to probably require something where a fixed
amount of DSM can describe an arbitrary number of temp file series.
But that also means this is an even-more-special-purpose tool that
shouldn't be deeply tied into parallel.c so that it can run before any
errors happen.

Basically, I think the "let's write the code between here and here so
it throws no errors" technique is, for 99% of PostgreSQL programming,
difficult and fragile.  We shouldn't rely on it if there is some other
reasonable option.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

Reply via email to