On Wed, Mar 8, 2017 at 9:05 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Wed, Feb 8, 2017 at 12:04 AM, Fujii Masao <masao.fu...@gmail.com> wrote:
>> On Tue, Feb 7, 2017 at 2:13 AM, Petr Jelinek
>> <petr.jeli...@2ndquadrant.com> wrote:
>>> On 06/02/17 17:33, Fujii Masao wrote:
>>>> On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
>>>> <petr.jeli...@2ndquadrant.com> wrote:
>>>>> On 03/02/17 19:38, Fujii Masao wrote:
>>>>>> On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao.fu...@gmail.com> 
>>>>>> wrote:
>>>>>>> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
>>>>>>> <horiguchi.kyot...@lab.ntt.co.jp> wrote:
>>>>>>>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao.fu...@gmail.com> 
>>>>>>>> wrote in 
>>>>>>>> <CAHGQGwHqQVHmQ7wM=elnnp1_oy-gvssacajxwje4nc2twsq...@mail.gmail.com>
>>>>>>>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
>>>>>>>>> <michael.paqu...@gmail.com> wrote:
>>>>>>>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>>>>>>>>>> Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> writes:
>>>>>>>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that
>>>>>>>>>>>> some functions called from there can throw exceptions.
>>>>>>>>>>> Yes, but all LWLocks should be released by normal error recovery.
>>>>>>>>>>> It should not be necessary for this code to clean that up by hand.
>>>>>>>>>>> If it were necessary, there would be TRY-CATCH around every single
>>>>>>>>>>> LWLockAcquire in the backend, and we'd have an unreadable and
>>>>>>>>>>> unmaintainable system.  Please don't add a TRY-CATCH unless it's
>>>>>>>>>>> *necessary* -- and you haven't explained why this one is.
>>>>>>>>> Yes.
>>>>>>>> Thank you for the suggestion. I minunderstood that.
>>>>>>>>>> Putting hands into the code and at the problem, I can see that
>>>>>>>>>> dropping a subscription on a node makes it unresponsive in case of a
>>>>>>>>>> stop. And that's just because calls to LWLockRelease are missing as 
>>>>>>>>>> in
>>>>>>>>>> the patch attached. A try/catch problem should not be necessary.
>>>>>>>>> Thanks for the patch!
>>>>>>>>> With the patch, LogicalRepLauncherLock is released at the end of
>>>>>>>>> DropSubscription(). But ISTM that the lock should be released just 
>>>>>>>>> after
>>>>>>>>> logicalrep_worker_stop() and there is no need to protect the removal 
>>>>>>>>> of
>>>>>>>>> replication slot with the lock.
>>>>>>>> That's true. logicalrep_worker_stop returns after confirmig that
>>>>>>>> worker->proc is cleard, so no false relaunch cannot be caused.
>>>>>>>> After all, logicalrep_worker_stop is surrounded by
>>>>>>>> LWLockAcquire/Relase pair. So it can be moved into the funciton
>>>>>>>> and make the lock secrion to be more narrower.
>>>>>> If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
>>>>>> removed and the comment for logicalrep_worker_stop() should be updated.
>>>>>> Your approach may cause the deadlock. The launcher takes 
>>>>>> LogicalRepWorkerLock
>>>>>> while holding LogicalRepLauncherLock. OTOH, with your approach,
>>>>>> logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
>>>>>> LogicalRepWorkerLock.
>>>>>> Therefore I pushed the simple patch which adds LWLockRelease() just after
>>>>>> logicalrep_worker_stop().
>>>>>> Another problem that I found while reading the code is that the launcher 
>>>>>> can
>>>>>> start up the worker with the subscription that DROP SUBSCRIPTION just 
>>>>>> removed.
>>>>>> That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
>>>>>> but the launcher can see it and start new worker until the transaction 
>>>>>> for
>>>>>> DROP has been committed.
>>>>> That was the reason why DropSubscription didn't release the lock in the
>>>>> first place. It was supposed to be released at the end of the
>>>>> transaction though.
>>>> OK, I understood why you used the lock in that way. But using LWLock
>>>> for that purpose is not valid.
>>> Yeah, I just tried to avoid what we are doing now really hard :)
>>>>>> To fix this issue, I think that DROP SUBSCRIPTION should take
>>>>>> AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
>>>>>> so that the launcher cannot see the entry to be being removed.
>>>>> The whole point of having LogicalRepLauncherLock is to avoid having to
>>>>> do this, so if we do this we could probably get rid of it.
>>>> Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
>>>> with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
>>>> Attached patch does this.
>>> Okay, looks reasonable to me.
>> Thanks for the review!
>> But ISMT that I should suspend committing the patch until we fix the issue
>> that Sawada reported in other thread. That bugfix may change the related
>> code and design very much.
>> https://www.postgresql.org/message-id/cad21aod+vo93zz4zqtzqb-jz_wmko3oggdx1mxo4t+8q_zh...@mail.gmail.com
> That patch has been committed. And this issue still happens. Should we
> add this to the open item list so it doesn't get missed?

Thanks for ping. Pushed the patch.


Fujii Masao

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to