On Wed, Feb 8, 2017 at 12:04 AM, Fujii Masao <masao.fu...@gmail.com> wrote:
> On Tue, Feb 7, 2017 at 2:13 AM, Petr Jelinek
> <petr.jeli...@2ndquadrant.com> wrote:
>> On 06/02/17 17:33, Fujii Masao wrote:
>>> On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
>>> <petr.jeli...@2ndquadrant.com> wrote:
>>>> On 03/02/17 19:38, Fujii Masao wrote:
>>>>> On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fu...@gmail.com> 
>>>>> wrote:
>>>>>> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
>>>>>> <horiguchi.kyot...@lab.ntt.co.jp> wrote:
>>>>>>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fu...@gmail.com> 
>>>>>>> wrote in 
>>>>>>> <CAHGQGwHqQVHmQ7wM=elnnp1_oy-gvssacajxwje4nc2twsq...@mail.gmail.com>
>>>>>>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
>>>>>>>> <michael.paqu...@gmail.com> wrote:
>>>>>>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>>>>>>>>> Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> writes:
>>>>>>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that
>>>>>>>>>>> some functions called from there can throw exceptions.
>>>>>>>>>> Yes, but all LWLocks should be released by normal error recovery.
>>>>>>>>>> It should not be necessary for this code to clean that up by hand.
>>>>>>>>>> If it were necessary, there would be TRY-CATCH around every single
>>>>>>>>>> LWLockAcquire in the backend, and we'd have an unreadable and
>>>>>>>>>> unmaintainable system.  Please don't add a TRY-CATCH unless it's
>>>>>>>>>> *necessary* -- and you haven't explained why this one is.
>>>>>>>> Yes.
>>>>>>> Thank you for the suggestion. I minunderstood that.
>>>>>>>>> Putting hands into the code and at the problem, I can see that
>>>>>>>>> dropping a subscription on a node makes it unresponsive in case of a
>>>>>>>>> stop. And that's just because calls to LWLockRelease are missing as in
>>>>>>>>> the patch attached. A try/catch problem should not be necessary.
>>>>>>>> Thanks for the patch!
>>>>>>>> With the patch, LogicalRepLauncherLock is released at the end of
>>>>>>>> DropSubscription(). But ISTM that the lock should be released just 
>>>>>>>> after
>>>>>>>> logicalrep_worker_stop() and there is no need to protect the removal of
>>>>>>>> replication slot with the lock.
>>>>>>> That's true. logicalrep_worker_stop returns after confirmig that
>>>>>>> worker->proc is cleard, so no false relaunch cannot be caused.
>>>>>>> After all, logicalrep_worker_stop is surrounded by
>>>>>>> LWLockAcquire/Relase pair. So it can be moved into the funciton
>>>>>>> and make the lock secrion to be more narrower.
>>>>> If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
>>>>> removed and the comment for logicalrep_worker_stop() should be updated.
>>>>> Your approach may cause the deadlock. The launcher takes 
>>>>> LogicalRepWorkerLock
>>>>> while holding LogicalRepLauncherLock. OTOH, with your approach,
>>>>> logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
>>>>> LogicalRepWorkerLock.
>>>>> Therefore I pushed the simple patch which adds LWLockRelease() just after
>>>>> logicalrep_worker_stop().
>>>>> Another problem that I found while reading the code is that the launcher 
>>>>> can
>>>>> start up the worker with the subscription that DROP SUBSCRIPTION just 
>>>>> removed.
>>>>> That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
>>>>> but the launcher can see it and start new worker until the transaction for
>>>>> DROP has been committed.
>>>> That was the reason why DropSubscription didn't release the lock in the
>>>> first place. It was supposed to be released at the end of the
>>>> transaction though.
>>> OK, I understood why you used the lock in that way. But using LWLock
>>> for that purpose is not valid.
>> Yeah, I just tried to avoid what we are doing now really hard :)
>>>>> To fix this issue, I think that DROP SUBSCRIPTION should take
>>>>> AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
>>>>> so that the launcher cannot see the entry to be being removed.
>>>> The whole point of having LogicalRepLauncherLock is to avoid having to
>>>> do this, so if we do this we could probably get rid of it.
>>> Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
>>> with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
>>> Attached patch does this.
>> Okay, looks reasonable to me.
> Thanks for the review!
> But ISMT that I should suspend committing the patch until we fix the issue
> that Sawada reported in other thread. That bugfix may change the related
> code and design very much.
> https://www.postgresql.org/message-id/cad21aod+vo93zz4zqtzqb-jz_wmko3oggdx1mxo4t+8q_zh...@mail.gmail.com

That patch has been committed. And this issue still happens. Should we
add this to the open item list so it doesn't get missed?


Masahiko Sawada
NTT Open Source Software Center

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to