Re: Exit walsender before confirming remote flush in logical replication

Andres Freund Sun, 05 Feb 2023 21:03:31 -0800

Hi, 

On February 5, 2023 8:29:19 PM PST, Amit Kapila <amit.kapil...@gmail.com> wrote:
>On Sat, Feb 4, 2023 at 6:31 PM Andres Freund <and...@anarazel.de> wrote:
>>
>> On 2023-02-02 11:21:54 +0530, Amit Kapila wrote:
>> > The main problem we want to solve here is to avoid shutdown failing in
>> > case walreceiver/applyworker is busy waiting for some lock or for some
>> > other reason as shown in the email [1].
>>
>> Isn't handling this part of the job of wal_sender_timeout?
>>
>
>In some cases, it is not clear whether we can handle it by
>wal_sender_timeout. Consider a case of a time-delayed replica where
>the applyworker will keep sending some response/alive message so that
>walsender doesn't timeout in that (during delay) period. In that case,
>because walsender won't timeout, the shutdown will fail (with the
>failed message) even though it will be complete after the walsender is
>able to send all the WAL and shutdown. The time-delayed replica patch
>is still under discussion [1]. Also, for large values of
>wal_sender_timeout, it will wait till the walsender times out and can
>return with a failed message.
>
>>
>> I don't at all agree that it's ok to just stop replicating changes
>> because we're blocked on network IO. The patch justifies this with:
>>
>> > Currently, at shutdown, walsender processes wait to send all pending data 
>> > and
>> > ensure the all data is flushed in remote node. This mechanism was added by
>> > 985bd7 for supporting clean switch over, but such use-case cannot be 
>> > supported
>> > for logical replication. This commit remove the blocking in the case.
>>
>> and at the start of the thread with:
>>
>> > In case of logical replication, however, we cannot support the use-case 
>> > that
>> > switches the role publisher <-> subscriber. Suppose same case as above, 
>> > additional
>> > transactions are committed while doing step2. To catch up such changes 
>> > subscriber
>> > must receive WALs related with trans, but it cannot be done because 
>> > subscriber
>> > cannot request WALs from the specific position. In the case, we must 
>> > truncate all
>> > data in new subscriber once, and then create new subscription with 
>> > copy_data
>> > = true.
>>
>> But that seems a too narrow view to me. Imagine you want to decomission
>> the current primary, and instead start to use the logical standby as the
>> primary. For that you'd obviously want to replicate the last few
>> changes. But with the proposed change, that'd be hard to ever achieve.
>>
>
>I think that can still be achieved with the idea being discussed which
>is to keep allowing sending the WAL for smart shutdown mode but not
>for other modes(fast, immediate). I don't know whether it is a good
>idea or not but Kuroda-San has produced a POC patch for it. We can
>instead choose to improve our docs related to shutdown to explain a
>bit more about the shutdown's interaction with (logical and physical)
>replication. As of now, it says: (“Smart” mode disallows new
>connections, then waits for all existing clients to disconnect. If the
>server is in hot standby, recovery and streaming replication will be
>terminated once all clients have disconnected.)[2]. Here, it is not
>clear that shutdown will wait for sending and flushing all the WALs.
>The information for fast and immediate modes is even lesser which
>makes it more difficult to understand what kind of behavior is
>expected in those modes.
>
>[1] - https://commitfest.postgresql.org/42/3581/
>[2] - https://www.postgresql.org/docs/devel/app-pg-ctl.html
>


Smart shutdown is practically unusable. I don't think it makes sense to tie 
behavior of walsender to it in any way. 


Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Exit walsender before confirming remote flush in logical replication

Reply via email to