very long secondary->primary switch time

Tomas Pospisek Tue, 27 Apr 2021 10:15:32 -0700

Hello all,

I maintain a postgresql cluster that does failover via patroni. Theproblem is that after a failover happens it takes the secondary too long(that is about 35min) to come up and answer queries. The log of thesecondary looks like this:



04:00:29.777 [9679] LOG:  received promote request

04:00:29.780 [9693] FATAL: terminating walreceiver process due toadministrator command04:00:29.780 [9679] LOG: invalid record length at 320/B95A1EE0: wanted24, got 0

04:00:29.783 [9679] LOG:  redo done at 320/B95A1EA8

04:00:29.783 [9679] LOG: last completed transaction was at log time2021-03-03 03:57:46.466342+01


04:35:00.982 [9679] LOG:  selected new timeline ID: 15
04:35:01.404 [9679] LOG:  archive recovery complete
04:35:02.337 [9662] LOG:  database system is ready to accept connections

The cluster is "fairly large" with thousands of DBs (sic!) and ~1TB of data.

I would like to shorten the failover/startup time drastically. Why doesit take the secondary that much time to switch to the primary state?There are no logs between 04:00 and 04:35. What is postgresql doingduring those 35min?

I am *guessing* that postgresql *might* be doing some consistency checkor replaying the WAL (max_wal_size: 16 GB, wal_keep_segments: 100). I amalso *guessing* that startup time *might* have to do with the size ofthe data (~1T) or/and with the numbers of DBs (thousands). If that wouldbe the case, then splitting the cluster into multiple clusters shouldallow for faster startup times?

I have tried to duckduck why the secondary takes that much time toswitch to primary mode, but have failed to find information that wouldenlighten me. So any pointers to information, hints or help are verywellcome.


Thanks & greets,
*t

very long secondary->primary switch time

Reply via email to