RE: Exit walsender before confirming remote flush in logical replication

Hayato Kuroda (Fujitsu) Thu, 26 Jan 2023 20:12:20 -0800

Dear Dilip, hackers,

Thanks for giving your opinion. I analyzed the relation with the given commit,
and I thought I could keep my patch. How do you think?


# Abstract

* Some modifications should be needed.
* We cannot rollback the shutdown if walsenders are stuck
* We don't have a good way to detect stuck

# Discussion

Compared to physical replication, it is likely to happen that logical 
replication is stuck.
I think the risk should be avoided as much as possible by fixing codes.
Then, if it leads another failure, we can document the caution to users.

While shutting down the server, checkpointer sends SIGUSR1 signal to wansenders.
It is done after being exited other processes, so we cannot raise ERROR and 
rollback
the operation if checkpointer recognize the process stuck at that time.

We don't have any features that postmaster can check  whether this node is
publisher or not. So if we want to add the mechanism that can check the health
of walsenders before shutting down, we must do that at the top of
process_pm_shutdown_request() even if we are not in logical replication.
I think it affects the basis of postgres largely, and in the first place,
PostgreSQL does not have a mechanism to check the health of process.

Therefore, I want to adopt the approach that walsender itself exits immediately 
when they get signals.

## About patch - Were fixes correct?

In ProcessPendingWrites(), my patch, wansender calls WalSndDone() when it gets
SIGUSR1 signal. I think this should be. From the patch [1]:

```
@@ -1450,6 +1450,10 @@ ProcessPendingWrites(void)
                /* Try to flush pending output to the client */
                if (pq_flush_if_writable() != 0)
                        WalSndShutdown();
+
+               /* If we got shut down requested, try to exit the process */
+               if (got_STOPPING)
+                       WalSndDone(XLogSendLogical);
        }
 
        /* reactivate latch so WalSndLoop knows to continue */
```


Per my analysis, in case of logical replication, walsenders exit with following
steps. Note that logical walsender does not receive SIGUSR2 signal, set flag by
themselves instead:

1. postmaster sends shutdown requests to checkpointer
2. checkpointer sends SIGUSR1 to walsenders and wait
3. when walsenders accept SIGUSR1, they turn got_SIGUSR1 on.
4. walsenders consume all WALs. @XLogSendLogical
5. walsenders turn got_SIGUSR2 on by themselves @XLogSendLogical
6. walsenders recognize the flag is on, so call WalSndDone() @ WalSndLoop
7. proc_exit(0)
8. checkpoitner writes shutdown record
...

Type (i) stuck, I reported in -hackers[1], means that processes stop at step 6
and Type (ii) stuck means that processes stop at 4. In step4, got_SIGUSR2 is 
never set to on, so
we must use got_STOPPING flag.

[1]: 
https://www.postgresql.org/message-id/tycpr01mb58701a47f35fed0a2b399662f5...@tycpr01mb5870.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: Exit walsender before confirming remote flush in logical replication

Reply via email to