On Wed, Mar 25, 2026 at 1:51 AM Nisha Moond <[email protected]> wrote: > Thank you, Fujii-san, for sharing the steps. I am now able to > reproduce the behavior where promotion gets stuck because the slot > sync worker remains in a wait loop.
Thanks for the test! > As an experiment, I tried setting tcp_user_timeout to 7000 / 15000 > (using slightly higher values for debugging). With this setting, the > TCP stack terminates the connection if data sent to the primary > remains unacknowledged beyond the configured timeout (e.g., due to a > network drop). In such cases the slot sync worker exits instead of > waiting indefinitely. With an appropriately tuned timeout, this could > help avoid the promotion issue by ensuring the worker does not remain > stuck when the connection to the primary is lost. Yes, TCP timeout settings like tcp_user_timeout, keepalives, and net.ipv4.tcp_retries2 can help in this situation. However, they involve a trade-off: using very small timeouts can reduce failover time but increases the risk of false network failure detection, while larger timeouts (e.g., 10s) avoid false positives but can delay failover by that amount. Because of this, I think it's better to address the issue without relying on such TCP timeout parameters. Also, tcp_user_timeout is not available on platforms that don't support TCP_USER_TIMEOUT (e.g., Windows). Regards, -- Fujii Masao
