Micheal, Thanks for much for the quick response.
> On 15 Dec 2021, at 00:37, Michael Paquier <mich...@paquier.xyz> wrote: > > On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote: >> The reindex went fine in the primary database and in one of our >> standby. The other standby that we also operate for some reason >> ended up in a state where all transactions were locked by the WAL >> process and the WAL process was not able to make any progress. In >> order to solve this issue we had to move traffic from the “bad” >> standby to the healthy one and then kill all transactions that were >> running in the “bad” standby. After that, replication was able to >> resume successfully. > > You are referring to the startup process that replays WAL, right? That is correct, I’m talking about the startup process that replays the WAL files. > Without having an idea about the type of workload your primary and/or > standbys are facing, as well as an idea of the configuration you are > using on both (hot_standby_feedback for one), I have no direct idea, Primary handle IOT data ingestion. The table that we had to REINDEX gets updated every time a new message arrives in the system so updated are happening very often on that table, thus, the index/table bloat. The standby at any point in time would be receiving queries that would take advantage of the indexes that were being re indexed. hot_standby_feedback is currently turned OFF on the standbys. > but that could be a conflict caused by a concurrent vacuum. > > Seeing where things got stuck could also be useful, perhaps with a > backtrace of the area where it happens and some information around > it. > >> I’m just trying to understand what could have caused this issue. I >> was not able to identify any queries in the standby that would be >> locking the WAL process. Any insight would be more than welcome! > > That's not going to be easy without more information, I am afraid. > -- > Michael