Hello, Hope you are doing well.
I've been playing a bit with the incremental backup feature which might come as part of the 17 release, and I think I hit a possible bug in the WAL summarizer process. The issue that I face refers to the summarizer process getting into a hung state. When the issue is triggered, it keeps in an infinite loop trying to process a WAL file that no longer exists. It apparently comes up only when I perform changes to `wal_summarize` GUC and reload Postgres, while there is some load in Postgres which makes it recycle WAL files. I'm running Postgres 17 in a Rockylinux 9 VM. In order to have less WAL files available in `pg_wal` and make it easier to reproduce the issue, I'm using a low value for `max_wal_size` ('100MB'). You can find below the steps that I took to reproduce this problem, assuming this small `max_wal_size`, and `summarize_wal` initially enabled: ```bash # Assume we initially have max_wal_size = '100MB' and summarize_wal = on # Create a table of ~ 100MB psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)" # Take a full backup pg_basebackup -X none -c fast -P -D full_backup_1 # Recreate a table of ~ 100MB psql -c "DROP TABLE test" psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)" # Take an incremental backup pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental full_backup_1/backup_manifest # Disable summarize_wal psql -c "ALTER SYSTEM SET summarize_wal TO off" psql -c "SELECT pg_reload_conf()" # Recreate a table of ~ 100MB psql -c "DROP TABLE test" psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)" # Re-enable sumarize_wal psql -c "ALTER SYSTEM SET summarize_wal TO on" psql -c "SELECT pg_reload_conf()" # Take a full backup pg_basebackup -X none -c fast -P -D full_backup_2 # Take an incremental backup pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental full_backup_2/backup_manifest ``` I'm able to reproduce the issue most of the time when running these steps manually. It's harder to reproduce if I attempt to run those commands as a bash script. This is the sample output of a run of those commands: ```console (barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"SELECT 3000000(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D full_backup_1NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup331785/331785 kB (100%), 1/1 tablespace(barman) [postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"SELECT 3000000(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D incremental_backup_1 --incremental full_backup_1/backup_manifestNOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup111263/331720 kB (33%), 1/1 tablespace(barman) [postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO off"ALTER SYSTEM(barman) [postgres@barmandevhost ~]$ psql -c "SELECT pg_reload_conf()" pg_reload_conf---------------- t(1 row) (barman) [postgres@barmandevhost ~]$ psql -c "DROP TABLE test"DROP TABLE(barman) [postgres@barmandevhost ~]$ psql -c "CREATE TABLE test AS SELECT generate_series(1, 3000000)"SELECT 3000000(barman) [postgres@barmandevhost ~]$ psql -c "ALTER SYSTEM SET summarize_wal TO on"ALTER SYSTEM(barman) [postgres@barmandevhost ~]$ psql -c "SELECT pg_reload_conf()" pg_reload_conf---------------- t(1 row) (barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D full_backup_2NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup331734/331734 kB (100%), 1/1 tablespace(barman) [postgres@barmandevhost ~]$ pg_basebackup -X none -c fast -P -D incremental_backup_2 --incremental full_backup_2/backup_manifestWARNING: still waiting for WAL summarization through 2/C1000028 after 10 secondsDETAIL: Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: still waiting for WAL summarization through 2/C1000028 after 20 secondsDETAIL: Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: still waiting for WAL summarization through 2/C1000028 after 30 secondsDETAIL: Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: still waiting for WAL summarization through 2/C1000028 after 40 secondsDETAIL: Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: still waiting for WAL summarization through 2/C1000028 after 50 secondsDETAIL: Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: still waiting for WAL summarization through 2/C1000028 after 60 secondsDETAIL: Summarization has reached 2/B30000D8 on disk and 2/B30000D8 in memory.WARNING: aborting backup due to backend exiting before pg_backup_stop was calledpg_basebackup: error: could not initiate base backup: ERROR: WAL summarization is not progressingDETAIL: Summarization is needed through 2/C1000028, but is stuck at 2/B30000D8 on disk and 2/B30000D8 in memory.pg_basebackup: removing data directory "incremental_backup_2" ``` I took an `ls` output from `pg_wal` as well as `strace` and `gdb` from the WAL summarizer process. I'm attaching that to this email hoping that can help somehow. FWIW, once I restart Postgres the WAL summarizer process gets back to normal functioning. It seems to me there is some race condition between when a WAL file is removed and when `summarize_wal` is re-enabled, causing the process to keep looking for a WAL file that is the past. Best regards, Israel.