On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
<petr.jeli...@2ndquadrant.com> wrote:
> On 03/02/17 19:38, Fujii Masao wrote:
>> On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fu...@gmail.com> wrote:
>>> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
>>> <horiguchi.kyot...@lab.ntt.co.jp> wrote:
>>>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fu...@gmail.com> 
>>>> wrote in 
>>>> <CAHGQGwHqQVHmQ7wM=elnnp1_oy-gvssacajxwje4nc2twsq...@mail.gmail.com>
>>>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
>>>>> <michael.paqu...@gmail.com> wrote:
>>>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>>>>>> Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> writes:
>>>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that
>>>>>>>> some functions called from there can throw exceptions.
>>>>>>> Yes, but all LWLocks should be released by normal error recovery.
>>>>>>> It should not be necessary for this code to clean that up by hand.
>>>>>>> If it were necessary, there would be TRY-CATCH around every single
>>>>>>> LWLockAcquire in the backend, and we'd have an unreadable and
>>>>>>> unmaintainable system.  Please don't add a TRY-CATCH unless it's
>>>>>>> *necessary* -- and you haven't explained why this one is.
>>>>> Yes.
>>>> Thank you for the suggestion. I minunderstood that.
>>>>>> Putting hands into the code and at the problem, I can see that
>>>>>> dropping a subscription on a node makes it unresponsive in case of a
>>>>>> stop. And that's just because calls to LWLockRelease are missing as in
>>>>>> the patch attached. A try/catch problem should not be necessary.
>>>>> Thanks for the patch!
>>>>> With the patch, LogicalRepLauncherLock is released at the end of
>>>>> DropSubscription(). But ISTM that the lock should be released just after
>>>>> logicalrep_worker_stop() and there is no need to protect the removal of
>>>>> replication slot with the lock.
>>>> That's true. logicalrep_worker_stop returns after confirmig that
>>>> worker->proc is cleard, so no false relaunch cannot be caused.
>>>> After all, logicalrep_worker_stop is surrounded by
>>>> LWLockAcquire/Relase pair. So it can be moved into the funciton
>>>> and make the lock secrion to be more narrower.
>> If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
>> removed and the comment for logicalrep_worker_stop() should be updated.
>> Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
>> while holding LogicalRepLauncherLock. OTOH, with your approach,
>> logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
>> LogicalRepWorkerLock.
>> Therefore I pushed the simple patch which adds LWLockRelease() just after
>> logicalrep_worker_stop().
>> Another problem that I found while reading the code is that the launcher can
>> start up the worker with the subscription that DROP SUBSCRIPTION just 
>> removed.
>> That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
>> but the launcher can see it and start new worker until the transaction for
>> DROP has been committed.
> That was the reason why DropSubscription didn't release the lock in the
> first place. It was supposed to be released at the end of the
> transaction though.

OK, I understood why you used the lock in that way. But using LWLock
for that purpose is not valid.

>> To fix this issue, I think that DROP SUBSCRIPTION should take
>> AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
>> so that the launcher cannot see the entry to be being removed.
> The whole point of having LogicalRepLauncherLock is to avoid having to
> do this, so if we do this we could probably get rid of it.

Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
Attached patch does this.


Fujii Masao

Attachment: bugfix.patch
Description: Binary data

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to