Re: [HACKERS] [sqlsmith] Unpinning error in parallel worker

Thomas Munro Tue, 28 Mar 2017 22:32:38 -0700

On Mon, Mar 27, 2017 at 6:53 PM, Thomas Munro
<[email protected]> wrote:
> On Mon, Mar 27, 2017 at 8:38 AM, Thomas Munro
> <[email protected]> wrote:
>> On Mon, Mar 27, 2017 at 4:18 AM, Andreas Seltenreich <[email protected]> 
>> wrote:
>>> Hi,
>>>
>>> today's testing with master as of d253b0f6e3 yielded two clusters that
>>> stopped processing queries.  Symptoms:
>>>
>>> [...]
>>
>> Thanks Andreas.  Investigating.
>
> First, the hanging stems from reentering dsm_backend_shutdown and
> trying to acquire DynamicSharedMemoryControlLock which we already
> acquired further down the stack in dsm_unpin_segment when it raised an
> error.  That's obviously not great, but the real question is how we
> reached this this-cannot-happen error condition.
>
> I reproduced this by inserting a sleep before dsa_attach_in_place,
> inserting a call to dsa_allocate into ExecInitParallelPlan so that the
> executor's DSA area owns at least one segment, and then cancelling a
> parallel query before the sleepy worker has managed to attach.  The
> DSA area is destroyed before the worker attaches, but the worker
> doesn't know this, and goes on to destroy it again after it learns
> that the query has been cancelled.
>
> In an earlier version of DSA, attaching should have failed in this
> scenario because the handle would be invalid.  Based on complaints
> about creating an extra DSM segment all the time even if we don't turn
> out to need it, I implemented "in place" DSA areas where the control
> object is in user-supplied shared memory, in this case in the parallel
> query main DSM segment.  But that created a new hazard: if you try to
> attach to a piece of memory that contains the remains of a
> already-destroyed DSA area, then we don't do anything to detect that.
> Oops.
>
> The attached patch fixes that one way: it detects refcnt == 0 as a
> defunct DSA area and raises an error when you try to attach.
>
> Another approach which I could explore would be to "reset" the DSA
> area instead of destroying it when the last backend detaches.  I'm not
> sure if that would ever be useful as a feature in its own right, but
> it would at least allow a very late worker to attach and then detach
> in an orderly fashion in this query-cancelled case, so you wouldn't
> get a different error message in the worker in this rare case.
>
> Thoughts?


Added to open items.

I considered whether the error message could be improved but it
matches the message for an existing similar case (where you try to
attach to an unknown handle).

The alternative approach I mentioned above doesn't seem warranted, as
you can already get various different failure messages depending on
timing.

Based on feedback on another thread about how to make reviewers' and
committers' jobs easier, here is a format-patch version with a short
description as raw material for a commit message, in case that is
helpful.

-- 
Thomas Munro
http://www.enterprisedb.com

detect-late-dsa-attach-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Unpinning error in parallel worker

Reply via email to