Hi Melanie,

That all makes sense — your plan sounds right to me. Happy to go with
relaxing pgstat_tracks_io_op() for 19 and leaving the
table_beginscan_catalog() flags work for 20, and to drop my patch.

Reproducer attached.

Thanks so much for the quick and thoughtful response — really appreciate it!

Ewan

On Mon, Jun 1, 2026 at 11:05 PM Melanie Plageman <[email protected]>
wrote:

> On Sun, May 31, 2026 at 12:37 AM Ewan Young <[email protected]> wrote:
> >
> > I was stress-testing master (commit e2b35735b00, assertions enabled)
> with a
> > workload that does a lot of DDL/DML, including creating and dropping
> > databases in a tight loop, and the autovacuum launcher kept crashing on
> me --
> > every 15-40 minutes or so once it was under load:
> >
> >   TRAP: failed Assert("pgstat_tracks_io_op(MyBackendType, io_object,
> >         io_context, io_op)"), File: "pgstat_io.c", Line: 74
> >   LOG:  autovacuum launcher process (PID ...) was terminated by signal 6:
> >         Aborted
> >
> > The postmaster recovers fine, but it just starts another launcher that
> hits
> > the exact same assert, so it never really gets out of the loop.
>
> Ouch :( Thanks for the report!
>
> > What surprised me is that the launcher's catalog scan isn't even flagged
> > read-only (table_beginscan_catalog doesn't set SO_HINT_REL_READ_ONLY),
> > so it never actually intends to set the VM -- it just pins/extends it
> anyway.
>
> Yea, so SO_HINT_REL_READ_ONLY is only meant as a hint. I don't
> guarantee that all queries that aren't modifying the relation will
> pass it. It's only a performance optimization. We did touch on
> excluding scans of catalog tables briefly in the thread (albeit deep
> in the thread) [1].
>
> That being said, we still pin the VM (and potentially extend it) even
> if we don't set it. If it does already exist, pinning it lets us check
> for corruption. If we extend it and won't set it, it isn't totally
> wasted work because then we don't have to extend it later. Though it's
> true that if the VM needs to be extended, that part of the VM can't be
> corrupted, I wanted to avoid special casing the presence of the VM
> page.
>
> When I wrote pgstat_tracks_io_op(), I thought through which IO
> operations we should expect for each backend type. The idea was that
> if that were ever to change, we could change pgstat_tracks_io_op().
> There is no reason why the autovacuum launcher inherently shouldn't
> extend the VM. Logically, if it is just reading catalog tables, it
> won't need to extend the actual data fork of the relation. However,
> now that reading tables may cause extending the VM, we can modify
> pgstat_tracks_io_op() like this:
>
> -   if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
> -        bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
> +   if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
> +       io_op == IOOP_EXTEND)
>         return false;
>
> We should probably add a flags argument to table_beginscan_catalog()
> in 20. That along with modifying pgstat_tracks_io_op() is probably the
> right solution IMO. I think we shouldn't do that in 19 since it is
> expanding the feature footprint.
>
> So for fixing 19, we could just alter pgstat_tracks_io_op() which
> would avoid tripping the assert.
>
> The next question is whether or not we should also do the patch you
> proposed, since it is true that if the VM is not extended enough to
> have the relevant VM bits, we can't possibly be fixing corruption. And
> if we are not doing that and won't set the VM, we are prematurely
> doing work.
>
> My hesitation about this is that the logic is a bit confusing. It
> relies on us knowing that visibiltymap_pin() will extend the relation
> and visibilitymap_get_status() won't. Both will read and pin the VM if
> it exists. If in heap_page_prune_and_freeze() we do anything with the
> VM page, it must be pinned. So, you have to know that if it needed to
> be extended, we won't have done that if rel_read_only was passed and
> so vmbuffer will be invalid and visiblitymap_get_status() will have
> returned 0. I don't think it's wrong, it just feels a bit off in a way
> I can't quite put my finger on. Perhaps it is just that I liked the
> invariant that the VM page would always be passed to
> heap_page_prune_and_freeze(). I'll have to think about it a bit more.
>
> On Mon, Jun 1, 2026 at 4:41 AM Ewan Young <[email protected]> wrote:
> >
> > Two commits combine to cause it: 4f7ecca84dd added the unconditional
> > (extending) visibilitymap_pin() in the on-access prune path, and
> 378a216187a
> > made INSERT set pd_prune_xid, so on-access pruning now fires on
> insert-mostly
> > catalogs like pg_database.  That's also why it was hard to reduce: any
> > regular backend or autovacuum worker that scans pg_database recreates the
> > fork harmlessly (they may extend), so the launcher only crashes in the
> brief
> > window before that happens.  I can now reproduce it deterministically;
> happy
> > to share the script.
>
> Please do share your reproducer. It's always nice to have those.
>
> - Melanie
>
> [1]
> https://www.postgresql.org/message-id/9468F957-C0ED-4D72-8C89-61162CAA5591%40yandex-team.ru
>
#!/bin/bash
# Deterministic reproducer: autovacuum launcher TRAP in pgstat_tracks_io_op()
# (pgstat_io.c:74) caused by an IOOP_EXTEND of the visibility map.
#
# Mechanism (commits 4f7ecca84dd + 378a216187a on HEAD):
#   - The autovacuum launcher's get_database_list() seqscans pg_database without
#     SO_HINT_REL_READ_ONLY. If a page is FULL and its pd_prune_xid is removable
#     and the _vm fork is missing, pruning extends the VM -> IOOP_EXTEND, which
#     pgstat_tracks_io_op() asserts the launcher never does -> TRAP.
#
# The trigger is a race: the launcher must be the FIRST thing to scan a full,
# unpruned pg_database page while _vm is absent. Any other backend that scans
# first prunes the page (clearing pd_prune_xid) and recreates _vm harmlessly,
# closing the window. We make it deterministic by:
#   1. autovacuum OFF during the fill, plus a held REPEATABLE READ snapshot that
#      pins the xmin horizon so NOTHING is removable -> nothing prunes, so the
#      full pages keep their pd_prune_xid.
#   2. stop (drops the holder), delete the _vm fork to simulate vm extend.
#   3. restart with autovacuum ON + huge thresholds (so no worker is scheduled):
#      the launcher's get_database_list scan is the first/only scanner of
#      pg_database and trips the assert.
#
# Override the install/data locations if your layout differs:
#   PGBIN=/path/to/pg/bin PGDATA=/path/to/datadir ./av_launcher_vm_extend_repro.sh
set -u
BIN="${PGBIN:-$HOME/postgres-install/bin}"
PGDATA="${PGDATA:-/tmp/avrepro-pgdata}"
PORT="${PGPORT:-5440}"
LOG=/tmp/avrepro-pg.log
FP="global/1262"   # pg_database (mapped) main fork

crashed() { grep -qE "pgstat_tracks_io_op|autovacuum launcher process.*was terminated by signal" "$LOG" 2>/dev/null; }

$BIN/pg_ctl -D "$PGDATA" -m immediate stop >/dev/null 2>&1
rm -rf "$PGDATA" "$LOG"
$BIN/initdb -D "$PGDATA" -U postgres --auth=trust >/tmp/avrepro-initdb.log 2>&1 || { echo initdb failed; exit 1; }
cat >> "$PGDATA/postgresql.conf" <<EOF
fsync = off
port = $PORT
listen_addresses = 'localhost'
autovacuum = on
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 2000000000
autovacuum_vacuum_insert_threshold = 2000000000
autovacuum_freeze_max_age = 2000000000
EOF

# --- Phase 1: fill pg_database with autovacuum OFF and a held snapshot ---------
$BIN/pg_ctl -D "$PGDATA" -l "$LOG" -w -o "-c autovacuum=off" start >/dev/null 2>&1 \
  || { echo start failed; tail "$LOG"; exit 1; }

# Hold a REPEATABLE READ snapshot open to pin the xmin horizon for the whole fill
# so no insert's pd_prune_xid is ever "removable" -> nothing prunes pg_database.
$BIN/psql -U postgres -p $PORT -d postgres >/tmp/avrepro-holder.log 2>&1 <<'SQL' &
BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT pg_current_xact_id();
SELECT pg_sleep(600);
COMMIT;
SQL
HOLDER_PID=$!
sleep 1   # let the holder establish its snapshot before we start inserting

echo "[repro] filling pg_database with 220 databases (autovacuum off, snapshot held)..."
for i in $(seq 1 220); do echo "CREATE DATABASE reprodb_$i;"; done | $BIN/psql -U postgres -p $PORT -d postgres -q >/dev/null 2>&1
echo "[repro] pg_database size: $($BIN/psql -U postgres -p $PORT -d postgres -tAq -c "SELECT pg_relation_size('pg_database')") bytes"

kill "$HOLDER_PID" 2>/dev/null

# --- Phase 2: open the window, restart with autovacuum ON ----------------------
$BIN/pg_ctl -D "$PGDATA" -m fast stop >/dev/null 2>&1
rm -f "$PGDATA/${FP}_vm"            # delete VM fork -> next prune must extend it
: > "$LOG"
$BIN/pg_ctl -D "$PGDATA" -l "$LOG" -w start >/dev/null 2>&1

# The launcher scans pg_database within ~naptime; do NOT connect any client.
for s in $(seq 1 20); do
  crashed && break
  sleep 1
done
echo "[repro] launcher alive=$($BIN/pg_ctl -D "$PGDATA" status >/dev/null 2>&1 && echo yes || echo NO)  vm_fork=$(test -f "$PGDATA/${FP}_vm" && echo recreated || echo absent)"

echo "===================== RESULT ====================="
if crashed; then
  echo ">>> REPRODUCED: launcher assert/crash"
  grep -nE "TRAP:|pgstat_tracks_io_op|autovacuum launcher process.*terminated|signal 6" "$LOG" | head
else
  echo ">>> NO CRASH"
fi
echo "=================================================="
$BIN/pg_ctl -D "$PGDATA" -m immediate stop >/dev/null 2>&1

Reply via email to