On Sun, Mar 22, 2026 at 12:24 AM Tom Lane <[email protected]> wrote:
> I wrote: > > Unsurprisingly, applying this change to unmodified master results > > in the pg_waldump and pg_verifybackup tests falling over. More > > surprisingly, they still fall over after applying your fix to the > > decompressors, so there's some other source of garbage trailing > > data. I haven't figured out what. > > In the learn-something-new-every-day dept.: good ol' GNU tar itself > does that. By default, it zero-pads its output to a multiple of 10kB > after it's written the required terminator. Moreover, this behavior > is actually specified by POSIX: > > -x format > Specify the output archive format. The pax utility shall support > the following formats: > ... > ustar > The tar interchange format; see the EXTENDED DESCRIPTION > section. The default blocksize for this format for character > special archive files shall be 10240. Implementations shall > support all blocksize values less than or equal to 32256 that > are multiples of 512. > > So, astreamer_tar_parser_content's idea that it should disallow more > than 1024 bytes of trailer is completely wrong, which we would have > figured out long ago if the code attempting to enforce that weren't > completely broken. > > You could argue that this means the tar files our existing utilities > create aren't POSIX-compliant. I think it's all right though: we > can just say that we write these files with blocksize 1024 not > blocksize 10240, and tar-file readers are required to accept that > per the above spec text. > > However, this discourages me from editorializing on the file trailer > emitted by whatever wrote the tar file we are reading. I think > emitting it as-is is the most appropriate thing. So we should just > get rid of astreamer_tar_parser_content's nonfunctional error check > and not change its behavior otherwise. > > > OK, patch 5 of this set does that. I reworked your previous patches 2 and 3 slightly - mostly additional comments, and fixing a bug in use of sizeof(XLogLongPageHeader). Patch 4 here tries to fix the wrong use of cur_file in get_archive_wal_entry() cheers andrew
From 51d53b166df7c8eaebe49756e24088c16764807b Mon Sep 17 00:00:00 2001 From: Andrew Dunstan <[email protected]> Date: Sun, 22 Mar 2026 06:53:25 -0400 Subject: [PATCH v5 3/5] Fix init_archive_reader to not depend on cur_file. init_archive_reader() relied on privateInfo->cur_file to track which WAL segment was being read, but cur_file can become NULL if a member trailer is processed during a read_archive_file() call. This could cause unreproducible "could not find WAL in archive" failures, particularly with compressed archives where all the WAL data fits in a small number of compressed bytes. Fix by scanning the hash table after each read to find any cached WAL segment with sufficient data, instead of depending on cur_file. Also reduce the minimum data requirement from XLOG_BLCKSZ to sizeof(XLogLongPageHeaderData), since we only need the long page header to extract the segment size. Add a safety comment on cur_file in pg_waldump.h to document that it can change during a single read_archive_file() call. Author: Tom Lane <[email protected]> Discussion: https://postgr.es/m/[email protected] --- src/bin/pg_waldump/archive_waldump.c | 22 +++++++++++++++++----- src/bin/pg_waldump/pg_waldump.h | 9 ++++++++- 2 files changed, 25 insertions(+), 6 deletions(-) diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c index cd092a057ef..3fce2183099 100644 --- a/src/bin/pg_waldump/archive_waldump.c +++ b/src/bin/pg_waldump/archive_waldump.c @@ -173,17 +173,29 @@ init_archive_reader(XLogDumpPrivate *privateInfo, privateInfo->archive_wal_htab = ArchivedWAL_create(8, NULL); /* - * Read until we have at least one full WAL page (XLOG_BLCKSZ bytes) from - * the first WAL segment in the archive so we can extract the WAL segment - * size from the long page header. + * Read until we have at least one WAL segment with enough data to extract + * the WAL segment size from the long page header. + * + * We must not rely on cur_file here, because it can become NULL if a + * member trailer is processed during a read_archive_file() call. Instead, + * scan the hash table after each read to find any entry with sufficient + * data. */ - while (entry == NULL || entry->buf->len < XLOG_BLCKSZ) + while (entry == NULL) { + ArchivedWAL_iterator iter; + if (!read_archive_file(privateInfo, XLOG_BLCKSZ)) pg_fatal("could not find WAL in archive \"%s\"", privateInfo->archive_name); - entry = privateInfo->cur_file; + ArchivedWAL_start_iterate(privateInfo->archive_wal_htab, &iter); + while ((entry = ArchivedWAL_iterate(privateInfo->archive_wal_htab, + &iter)) != NULL) + { + if (entry->read_len >= sizeof(XLogLongPageHeaderData)) + break; + } } /* Extract the WAL segment size from the long page header */ diff --git a/src/bin/pg_waldump/pg_waldump.h b/src/bin/pg_waldump/pg_waldump.h index cde7c6ca3f2..ca0dfd97168 100644 --- a/src/bin/pg_waldump/pg_waldump.h +++ b/src/bin/pg_waldump/pg_waldump.h @@ -44,7 +44,14 @@ typedef struct XLogDumpPrivate Size archive_read_buf_size; #endif - /* What the archive streamer is currently reading */ + /* + * The buffer for the WAL file the archive streamer is currently reading, + * or NULL if none. It is quite risky to examine this anywhere except in + * astreamer_waldump_content(), since it can change multiple times during + * a single read_archive_file() call. However, it is safe to assume that + * if cur_file is different from a particular ArchivedWALFile of interest, + * then the archive streamer has finished reading that file. + */ struct ArchivedWALFile *cur_file; /* -- 2.43.0
From 9978359771c14b411704d808abc0f602119f0a9f Mon Sep 17 00:00:00 2001 From: Andrew Dunstan <[email protected]> Date: Sun, 22 Mar 2026 05:31:14 -0400 Subject: [PATCH v5 1/5] Fix finalization of decompressor astreamers. Send the correct amount of data to the next astreamer, not the whole allocated buffer size. It's unclear how we missed this bug; perhaps the use-cases so far are insensitive to trailing garbage. Author: Andrew Dunstan <[email protected]> Discussion: https://postgr.es/m/[email protected] --- src/fe_utils/astreamer_gzip.c | 9 +++++---- src/fe_utils/astreamer_lz4.c | 9 +++++---- src/fe_utils/astreamer_zstd.c | 2 +- 3 files changed, 11 insertions(+), 9 deletions(-) diff --git a/src/fe_utils/astreamer_gzip.c b/src/fe_utils/astreamer_gzip.c index 2e080c37a58..df392f67cab 100644 --- a/src/fe_utils/astreamer_gzip.c +++ b/src/fe_utils/astreamer_gzip.c @@ -347,10 +347,11 @@ astreamer_gzip_decompressor_finalize(astreamer *streamer) * End of the stream, if there is some pending data in output buffers then * we must forward it to next streamer. */ - astreamer_content(mystreamer->base.bbs_next, NULL, - mystreamer->base.bbs_buffer.data, - mystreamer->base.bbs_buffer.maxlen, - ASTREAMER_UNKNOWN); + if (mystreamer->bytes_written > 0) + astreamer_content(mystreamer->base.bbs_next, NULL, + mystreamer->base.bbs_buffer.data, + mystreamer->bytes_written, + ASTREAMER_UNKNOWN); astreamer_finalize(mystreamer->base.bbs_next); } diff --git a/src/fe_utils/astreamer_lz4.c b/src/fe_utils/astreamer_lz4.c index 2bc32b42879..605c188007b 100644 --- a/src/fe_utils/astreamer_lz4.c +++ b/src/fe_utils/astreamer_lz4.c @@ -397,10 +397,11 @@ astreamer_lz4_decompressor_finalize(astreamer *streamer) * End of the stream, if there is some pending data in output buffers then * we must forward it to next streamer. */ - astreamer_content(mystreamer->base.bbs_next, NULL, - mystreamer->base.bbs_buffer.data, - mystreamer->base.bbs_buffer.maxlen, - ASTREAMER_UNKNOWN); + if (mystreamer->bytes_written > 0) + astreamer_content(mystreamer->base.bbs_next, NULL, + mystreamer->base.bbs_buffer.data, + mystreamer->bytes_written, + ASTREAMER_UNKNOWN); astreamer_finalize(mystreamer->base.bbs_next); } diff --git a/src/fe_utils/astreamer_zstd.c b/src/fe_utils/astreamer_zstd.c index f26abcfd0fa..4b43ab795e3 100644 --- a/src/fe_utils/astreamer_zstd.c +++ b/src/fe_utils/astreamer_zstd.c @@ -347,7 +347,7 @@ astreamer_zstd_decompressor_finalize(astreamer *streamer) if (mystreamer->zstd_outBuf.pos > 0) astreamer_content(mystreamer->base.bbs_next, NULL, mystreamer->base.bbs_buffer.data, - mystreamer->base.bbs_buffer.maxlen, + mystreamer->zstd_outBuf.pos, ASTREAMER_UNKNOWN); astreamer_finalize(mystreamer->base.bbs_next); -- 2.43.0
From 91ed0e69a7df00be147548973a8172322998b528 Mon Sep 17 00:00:00 2001 From: Andrew Dunstan <[email protected]> Date: Sun, 22 Mar 2026 06:53:34 -0400 Subject: [PATCH v5 4/5] Fix get_archive_wal_entry to handle cur_file transitions reliably. As noted by Tom Lane, get_archive_wal_entry() uses cur_file in an unsafe way: a single read_archive_file() call can trigger multiple astreamer callbacks when compression is effective, causing cur_file to change several times (entry -> NULL -> new entry) within one call. The old code captured cur_file before the read and checked for changes after, but this missed intermediate transitions. This could cause spill-file handles to leak or data to not be flushed when the streamer finishes one segment and starts another within the same read. Restructure the spill logic to explicitly track the entry being spilled (spill_entry) separately from cur_file, and detect transitions at the top of each loop iteration. Also ensure spill file handles are closed on both success and error paths. Discussion: https://postgr.es/m/[email protected] --- src/bin/pg_waldump/archive_waldump.c | 117 ++++++++++++++++----------- 1 file changed, 69 insertions(+), 48 deletions(-) diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c index 3fce2183099..93ed856c674 100644 --- a/src/bin/pg_waldump/archive_waldump.c +++ b/src/bin/pg_waldump/archive_waldump.c @@ -463,11 +463,18 @@ free_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo) * found. If the archive streamer is reading a WAL file from the archive that * is not currently needed, that data is spilled to a temporary file for later * retrieval. + * + * Because a single read_archive_file() call may trigger multiple astreamer + * callbacks (especially when data compresses well), cur_file can change + * several times within one call: from one entry to NULL (member trailer), + * and then to a new entry (next member header). The spill logic below + * handles this by flushing and closing per-entry state whenever we detect + * that the streamer has moved on. */ static ArchivedWALFile * get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo) { - ArchivedWALFile *entry = NULL; + ArchivedWALFile *spill_entry = NULL; FILE *write_fp = NULL; /* @@ -477,6 +484,8 @@ get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo) */ while (1) { + ArchivedWALFile *entry; + /* * Search hash table. * @@ -488,64 +497,76 @@ get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo) entry = ArchivedWAL_lookup(privateInfo->archive_wal_htab, fname); if (entry != NULL) + { + /* Close any open spill file before returning. */ + if (write_fp != NULL) + fclose(write_fp); return entry; - - /* - * Capture the current entry before calling read_archive_file(), - * because cur_file may advance to a new segment during streaming. We - * hold this reference so we can flush any remaining buffer data and - * close the write handle once we detect that cur_file has moved on. - */ - entry = privateInfo->cur_file; - - /* - * Fetch more data either when no current file is being tracked or - * when its buffer has been fully flushed to the temporary file. - */ - if (entry == NULL || entry->buf->len == 0) - { - if (!read_archive_file(privateInfo, READ_CHUNK_SIZE)) - break; /* archive file ended */ } /* - * Archive streamer is reading a non-WAL file or an irrelevant WAL - * file. - */ - if (entry == NULL) - continue; - - /* - * The streamer is producing a WAL segment that isn't the one asked - * for; it must be arriving out of order. Spill its data to disk so - * it can be read back when needed. - */ - Assert(strcmp(fname, entry->fname) != 0); - - /* Create a temporary file if one does not already exist */ - if (!entry->spilled) - { - write_fp = prepare_tmp_write(entry->fname, privateInfo); - entry->spilled = true; - } - - /* Flush data from the buffer to the file */ - perform_tmp_write(entry->fname, entry->buf, write_fp); - resetStringInfo(entry->buf); - - /* - * If cur_file changed since we captured entry above, the archive - * streamer has finished this segment and moved on. Close its spill - * file handle so data is flushed to disk before the next segment - * starts writing to a different handle. + * If the streamer has moved on to a different entry than the one we + * were spilling, flush any remaining data for the old entry and close + * its spill file. */ - if (entry != privateInfo->cur_file && write_fp != NULL) + if (spill_entry != NULL && spill_entry != privateInfo->cur_file) { + if (spill_entry->buf->len > 0) + { + perform_tmp_write(spill_entry->fname, spill_entry->buf, + write_fp); + resetStringInfo(spill_entry->buf); + } fclose(write_fp); write_fp = NULL; + spill_entry = NULL; } + + /* + * If no WAL file is currently being streamed (cur_file is NULL), or + * the current spill entry's buffer has been fully flushed, we need + * more data from the archive. + */ + if (privateInfo->cur_file == NULL || + (spill_entry != NULL && spill_entry->buf->len == 0)) + { + if (!read_archive_file(privateInfo, READ_CHUNK_SIZE)) + break; /* archive fully exhausted */ + continue; /* re-check hash table and cur_file */ + } + + /* + * cur_file points to a WAL segment that isn't the one asked for; it + * must be arriving out of order. Spill its data to disk so it can be + * read back when needed. + */ + spill_entry = privateInfo->cur_file; + Assert(strcmp(fname, spill_entry->fname) != 0); + + /* Create a temporary file if one does not already exist */ + if (!spill_entry->spilled) + { + write_fp = prepare_tmp_write(spill_entry->fname, privateInfo); + spill_entry->spilled = true; + } + + /* Flush data from the buffer to the file */ + perform_tmp_write(spill_entry->fname, spill_entry->buf, write_fp); + resetStringInfo(spill_entry->buf); + + /* + * Read more data from the archive. This may add data to the current + * spill_entry's buffer, advance cur_file to a new entry, or set + * cur_file to NULL (member trailer). + */ + if (!read_archive_file(privateInfo, READ_CHUNK_SIZE)) + break; /* archive fully exhausted */ } + /* Close any open spill file before erroring out. */ + if (write_fp != NULL) + fclose(write_fp); + /* Requested WAL segment not found */ pg_fatal("could not find WAL \"%s\" in archive \"%s\"", fname, privateInfo->archive_name); -- 2.43.0
From 0de5fd971602ae00fd3bd62cf5da0d8f0a3cce5a Mon Sep 17 00:00:00 2001 From: Andrew Dunstan <[email protected]> Date: Sun, 22 Mar 2026 05:32:45 -0400 Subject: [PATCH v5 2/5] Fix failure to finalize the decompression pipeline at archive EOF. archive_waldump.c called astreamer_finalize() nowhere. This meant that any data retained in decompression buffers at the moment we detect archive EOF would never reach astreamer_waldump_content(), resulting in surprising failures if we actually need the last few bytes of the archive file. To fix, make read_archive_file() do the finalize once it detects EOF. Change its API to return a boolean "yes there's more data" rather than the entirely-misleading raw count of bytes read. Also document the contract that cur_file can change (or become NULL) during a single read_archive_file() call, since the decompression pipeline may produce enough output to trigger multiple astreamer callbacks. Author: Tom Lane <[email protected]> Discussion: https://postgr.es/m/[email protected] --- src/bin/pg_waldump/archive_waldump.c | 50 +++++++++++++++++++++++----- src/bin/pg_waldump/pg_waldump.h | 1 + 2 files changed, 42 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_waldump/archive_waldump.c b/src/bin/pg_waldump/archive_waldump.c index b078c2d6960..cd092a057ef 100644 --- a/src/bin/pg_waldump/archive_waldump.c +++ b/src/bin/pg_waldump/archive_waldump.c @@ -89,7 +89,7 @@ typedef struct astreamer_waldump static ArchivedWALFile *get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo); -static int read_archive_file(XLogDumpPrivate *privateInfo, Size count); +static bool read_archive_file(XLogDumpPrivate *privateInfo, Size count); static void setup_tmpwal_dir(const char *waldir); static void cleanup_tmpwal_dir_atexit(void); @@ -139,6 +139,7 @@ init_archive_reader(XLogDumpPrivate *privateInfo, pg_fatal("could not open file \"%s\"", privateInfo->archive_name); privateInfo->archive_fd = fd; + privateInfo->archive_fd_eof = false; streamer = astreamer_waldump_new(privateInfo); @@ -178,7 +179,7 @@ init_archive_reader(XLogDumpPrivate *privateInfo, */ while (entry == NULL || entry->buf->len < XLOG_BLCKSZ) { - if (read_archive_file(privateInfo, XLOG_BLCKSZ) == 0) + if (!read_archive_file(privateInfo, XLOG_BLCKSZ)) pg_fatal("could not find WAL in archive \"%s\"", privateInfo->archive_name); @@ -236,9 +237,10 @@ free_archive_reader(XLogDumpPrivate *privateInfo) /* * NB: Normally, astreamer_finalize() is called before astreamer_free() to * flush any remaining buffered data or to ensure the end of the tar - * archive is reached. However, when decoding WAL, once we hit the end - * LSN, any remaining buffered data or unread portion of the archive can - * be safely ignored. + * archive is reached. read_archive_file() may have done so. However, + * when decoding WAL we can stop once we hit the end LSN, so we may never + * have read all of the input file. In that case any remaining buffered + * data or unread portion of the archive can be safely ignored. */ astreamer_free(privateInfo->archive_streamer); @@ -384,7 +386,7 @@ read_archive_wal_page(XLogDumpPrivate *privateInfo, XLogRecPtr targetPagePtr, fname, privateInfo->archive_name, (long long int) (count - nbytes), (long long int) count); - if (read_archive_file(privateInfo, READ_CHUNK_SIZE) == 0) + if (!read_archive_file(privateInfo, READ_CHUNK_SIZE)) pg_fatal("unexpected end of archive \"%s\" while reading \"%s\": read %lld of %lld bytes", privateInfo->archive_name, fname, (long long int) (count - nbytes), @@ -490,7 +492,7 @@ get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo) */ if (entry == NULL || entry->buf->len == 0) { - if (read_archive_file(privateInfo, READ_CHUNK_SIZE) == 0) + if (!read_archive_file(privateInfo, READ_CHUNK_SIZE)) break; /* archive file ended */ } @@ -540,8 +542,22 @@ get_archive_wal_entry(const char *fname, XLogDumpPrivate *privateInfo) /* * Reads a chunk from the archive file and passes it through the streamer * pipeline for decompression (if needed) and tar member extraction. + * + * count is the maximum amount to try to read this time. Note that it's + * measured in raw file bytes, and may have little to do with how much + * comes out of decompression/extraction. + * + * Returns true if successful, false if there is no more data. + * + * Callers must be aware that a single call may trigger multiple callbacks + * in astreamer_waldump_content, so privateInfo->cur_file can change value + * (or become NULL) during a call. In particular, cur_file is set to NULL + * when the ASTREAMER_MEMBER_TRAILER callback fires at the end of a tar + * member; it is then set to a new entry when the next WAL member's + * ASTREAMER_MEMBER_HEADER callback fires, which may or may not happen + * within the same call. */ -static int +static bool read_archive_file(XLogDumpPrivate *privateInfo, Size count) { int rc; @@ -549,6 +565,11 @@ read_archive_file(XLogDumpPrivate *privateInfo, Size count) /* The read request must not exceed the allocated buffer size. */ Assert(privateInfo->archive_read_buf_size >= count); + /* Fail if we already reached EOF in a prior call. */ + if (privateInfo->archive_fd_eof) + return false; + + /* Try to read some more data. */ rc = read(privateInfo->archive_fd, privateInfo->archive_read_buf, count); if (rc < 0) pg_fatal("could not read file \"%s\": %m", @@ -562,8 +583,19 @@ read_archive_file(XLogDumpPrivate *privateInfo, Size count) astreamer_content(privateInfo->archive_streamer, NULL, privateInfo->archive_read_buf, rc, ASTREAMER_UNKNOWN); + else + { + /* + * We reached EOF, but there is probably still data queued in the + * astreamer pipeline's buffers. Flush it out to ensure that we + * process everything. + */ + astreamer_finalize(privateInfo->archive_streamer); + /* Set flag to ensure we don't finalize more than once. */ + privateInfo->archive_fd_eof = true; + } - return rc; + return true; } /* diff --git a/src/bin/pg_waldump/pg_waldump.h b/src/bin/pg_waldump/pg_waldump.h index 36893624f53..cde7c6ca3f2 100644 --- a/src/bin/pg_waldump/pg_waldump.h +++ b/src/bin/pg_waldump/pg_waldump.h @@ -35,6 +35,7 @@ typedef struct XLogDumpPrivate char *archive_dir; char *archive_name; /* Tar archive filename */ int archive_fd; /* File descriptor for the open tar file */ + bool archive_fd_eof; /* Have we reached EOF on archive_fd? */ astreamer *archive_streamer; char *archive_read_buf; /* Reusable read buffer for archive I/O */ -- 2.43.0
From cab3e116c1064afcc16e638555c74f7f460be55b Mon Sep 17 00:00:00 2001 From: Andrew Dunstan <[email protected]> Date: Sun, 22 Mar 2026 06:53:58 -0400 Subject: [PATCH v5 5/5] Remove nonfunctional tar file trailer size check. The ASTREAMER_ARCHIVE_TRAILER case in astreamer_tar_parser_content() intended to reject tar files whose trailer exceeded 2 blocks. However, the check compared 'len' after astreamer_buffer_bytes() had already consumed all the data and set len to 0, so the pg_fatal() could never fire. Moreover, per the POSIX specification for the ustar format, the last physical block of a tar archive is always full-sized, and "logical records after the two zero logical records may contain undefined data." GNU tar, for example, zero-pads its output to a 10kB boundary by default. So rejecting extra data after the two zero blocks would be wrong even if the check worked. Remove the dead check and update the comment to explain why trailing data is expected and harmless. Per report from Tom Lane. Discussion: https://postgr.es/m/[email protected] --- src/fe_utils/astreamer_tar.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/src/fe_utils/astreamer_tar.c b/src/fe_utils/astreamer_tar.c index 4b187f0a8c4..f8be5e4ff8a 100644 --- a/src/fe_utils/astreamer_tar.c +++ b/src/fe_utils/astreamer_tar.c @@ -236,12 +236,16 @@ astreamer_tar_parser_content(astreamer *streamer, astreamer_member *member, /* * We've seen an end-of-archive indicator, so anything more is - * buffered and sent as part of the archive trailer. But we - * don't expect more than 2 blocks. + * buffered and sent as part of the archive trailer. + * + * Per POSIX, the last physical block of a tar archive is + * always full-sized, so there may be undefined data after the + * two zero blocks that mark end-of-archive. GNU tar, for + * example, zero-pads to a 10kB boundary by default. We just + * buffer whatever we receive and pass it along at finalize + * time. */ astreamer_buffer_bytes(streamer, &data, &len, len); - if (len > 2 * TAR_BLOCK_SIZE) - pg_fatal("tar file trailer exceeds 2 blocks"); return; default: -- 2.43.0
