Hello hackers, While investigating the recent skink failure [1], I've reproduced this failure under Valgrind on a slow machine and found that this happens due to the last checkpoint recorded in the segment 2, that is removed in the test: The failure log contains: 2023-10-10 19:10:08.212 UTC [2144251][startup][:0] LOG: invalid checkpoint record 2023-10-10 19:10:08.214 UTC [2144251][startup][:0] PANIC: could not locate a valid checkpoint record
The line above: [19:10:02.701](318.076s) ok 1 - 000000010000000000000001 differs from 000000010000000000000002 tells us about the duration of previous operations (> 5 mins). src/test/recovery/tmp_check/log/026_overwrite_contrecord_primary.log: 2023-10-10 19:04:50.149 UTC [1845798][postmaster][:0] LOG: database system is ready to accept connections ... 2023-10-10 19:09:49.131 UTC [1847585][checkpointer][:0] LOG: checkpoint starting: time ... 2023-10-10 19:10:02.058 UTC [1847585][checkpointer][:0] LOG: checkpoint complete: ... lsn=0/*2093980*, redo lsn=0/1F62760 And here is one more instance of this failure [2]: 2022-11-08 02:35:25.826 UTC [1614205][][:0] PANIC: could not locate a valid checkpoint record 2022-11-08 02:35:26.164 UTC [1612967][][:0] LOG: startup process (PID 1614205) was terminated by signal 6: Aborted src/test/recovery/tmp_check/log/026_overwrite_contrecord_primary.log: 2022-11-08 02:29:57.961 UTC [1546469][][:0] LOG: database system is ready to accept connections ... 2022-11-08 02:35:10.764 UTC [1611737][][2/10:0] LOG: statement: SELECT pg_walfile_name(pg_current_wal_insert_lsn()) 2022-11-08 02:35:11.598 UTC [1546469][][:0] LOG: received immediate shutdown request The next successful run after the failure [1] shows the following duration: [21:34:48.556](180.150s) ok 1 - 000000010000000000000001 differs from 000000010000000000000002 And the last successful run: [03:03:53.892](126.206s) ok 1 - 000000010000000000000001 differs from 000000010000000000000002 So to fail on the test, skink should perform at least twice slower than usual, and may be it's an extraordinary condition indeed, but on the other hand, may be increase checkpoint_timeout as already done in several tests (015_promotion_pages, 038_save_logical_slots_shutdown, 039_end_of_wal, ...). [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2023-10-10%2017%3A10%3A11 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-11-07%2020%3A27%3A11 Best regards, Alexander