Hi Sami > Alternatively, this information might be better exposed in a new system > view, showing the "PID", "XID," and "reason" a transaction is blocking VACUUM. > This approach is more proactive as a DBA can continuously monitor for > blocking reasons and take action before it becomes an issue. Yes ,If this approach is acceptable, then as a reference for the standby conflict-handling and query replay mechanism, we could consider introducing a GUC parameter to terminate PID that blocks VACUUM freeze when the relation age is approaching the vacuum_failsafe_age threshold.
Thanks On Tue, Jan 6, 2026 at 6:02 AM Sami Imseih <[email protected]> wrote: > > Thank you all for the review comments, and sorry for the late reply. > > I will address the review comments in order. > > > > On Sat, Nov 15, 2025 at 9:25 AM Sami Imseih <[email protected]> wrote: > > > More importantly: > > > > > > 3/ As mentioned earlier in the thread, the "idle-in-transaction" > > > transactions is not being reported correctly, particularly for write > > > tansactions. I think that is an important missing case. The reason > > > for this is the cutoff xmin is not being looked up against the current > > > list of xid's, so we are not blaming the correct pid. > > > > > > 4/ > > > Thinking about point 3 above, I began to wonder if this > > > whole thing can be simplified with inspiration. Looking at the > > > existing BackendXidGetPid(), I think it can. > > > > > > Based on BackendXidGetPid(), I tried a new routine called > > > BackendXidFindCutOffReason() which can take in the cutoff xmin, > > > passed in by vacuum and can walk though the proc array and > > > determine the reason. We don't need to touch ComputeXidHorizons() > > > to make this work, it seems to me. This comes with an additional > > > walk though the procarray holding a shared lock, but I don't think > > > this will be an issue. > > > > > > Attached is a rough sketch of BackendXidFindCutOffReason() > > > For now, I just added NOTICE messages which will log with > > > VACUUM (verbose) for testing. > > > > Thanks for the revised proposal! Your approach is clear and makes the > > code easier to read. > > My approach is focused on correctness rather than simplicity. > > The current logic in ComputeXidHorizons can report the wrong PID for > blocking transactions. For example: > > 1. start a pgbench with 5 clients for some time ( i.e. 5 minutes ) > ``` > pgbench -i -s50 > pgbench -c5 -T300 > ```` > > 2. start a long running transaction that consumes an XID > ``` > postgres=# begin; > BEGIN > postgres=*# SELECT txid_current(), pg_backend_pid(); > txid_current | pg_backend_pid > --------------+---------------- > 3665231 | 266601 > (1 row) > ``` > > 3. run a vacuum > ``` > postgres=# vacuum verbose pgbench_accounts ; > INFO: vacuuming "postgres.public.pgbench_accounts" > INFO: finished vacuuming "postgres.public.pgbench_accounts": index scans: > 0 > pages: 0 removed, 59833 remain, 59043 scanned (98.68% of total), 0 > eagerly scanned > tuples: 0 removed, 1045578 remain, 35425 are dead but not yet removable > removable cutoff: 4301694, which was 35981 XIDs old when operation ended > oldest xmin source: active transaction (pid=267064) > frozen: 0 pages from table (0.00% of total) had 0 tuples frozen > visibility map: 0 pages set all-visible, 0 pages set all-frozen (0 > were all-visible) > index scan bypassed: 405 pages from table (0.68% of total) have 1494 > dead item identifiers > avg read rate: 1320.425 MB/s, avg write rate: 2.404 MB/s > buffer usage: 67745 hits, 49974 reads, 91 dirtied > WAL usage: 1 records, 0 full page images, 299 bytes, 0 full page image > bytes, 0 buffers full > system usage: CPU: user: 0.22 s, system: 0.06 s, elapsed: 0.29 s > VACUUM > ``` > > VACUUM reports the oldest-XID source PID as 267090 , but the correct > PID is 267064. This happens because the ComputeXidHorizons loop picks > the first PID it encounters with the cutoff XID, even if other backends > have the same XID. There’s no reliable way within that loop to identify > the actual blocking transaction. > > ``` > postgres=# SELECT pid, datname, usename, state, backend_xmin, > backend_xid, substr(query, 1, 20) as query > FROM pg_stat_activity; > pid | datname | usename | state | backend_xmin | > backend_xid | query > > --------+----------+----------+---------------------+--------------+-------------+---------------------- > 267064 | postgres | postgres | active | 4301694 | > 4351291 | UPDATE pgbench_branc > 267069 | postgres | postgres | active | 4301694 | > 4351332 | UPDATE pgbench_telle > 267067 | postgres | postgres | active | 4301694 | > 4351299 | UPDATE pgbench_branc > 267070 | postgres | postgres | active | 4301694 | > 4351279 | UPDATE pgbench_branc > 267068 | postgres | postgres | active | | > 4351325 | UPDATE pgbench_telle > 267066 | postgres | postgres | active | 4301694 | > 4351327 | UPDATE pgbench_branc > 267065 | postgres | postgres | active | 4301694 | > 4351292 | UPDATE pgbench_branc > 267077 | postgres | postgres | active | 4301694 | > 4351303 | UPDATE pgbench_branc > 266606 | postgres | postgres | active | 4301694 | > | SELECT pid, datname, > 267071 | postgres | postgres | active | | > | BEGIN; > 267072 | postgres | postgres | active | 4301694 | > 4351300 | UPDATE pgbench_telle > 267073 | postgres | postgres | active | 4301694 | > 4351258 | UPDATE pgbench_branc > 267075 | postgres | postgres | idle | | > | END; > 267074 | postgres | postgres | active | 4301694 | > 4351319 | UPDATE pgbench_branc > 267076 | postgres | postgres | active | | > 4351248 | END; > 267084 | postgres | postgres | active | 4301694 | > 4351330 | UPDATE pgbench_telle > 267078 | postgres | postgres | active | 4301694 | > 4351260 | UPDATE pgbench_branc > 267082 | postgres | postgres | active | | > 4351309 | END; > 267081 | postgres | postgres | active | | > 4351270 | UPDATE pgbench_branc > 267083 | postgres | postgres | active | | > 4351313 | END; > 267080 | postgres | postgres | active | 4301694 | > 4351311 | UPDATE pgbench_branc > 267079 | postgres | postgres | active | 4301694 | > 4351318 | UPDATE pgbench_branc > 267086 | postgres | postgres | active | 4301694 | > 4351335 | UPDATE pgbench_branc > 267085 | postgres | postgres | active | | > | BEGIN; > 267090 | postgres | postgres | idle in transaction | | > 4301694 | SELECT txid_current( ************ > ``` > > > - Your proposal incurs additional cost. Furthermore, the time lag > > between the execution of ComputeXidHorizons() and > > BackendXidFindCutOffReason() could lead to inaccurate logging. > > While scanning the proc array adds some overhead, it could be limited > to cases where multiple VACUUMs are stuck on the same cutoff XID, but > we will need to track the last cutoff-xmin to make that possible. > > Alternatively, this information might be better exposed in a new system > view, showing the "PID", "XID," and "reason" a transaction is blocking > VACUUM. > This approach is more proactive as a DBA can continuously monitor for > blocking reasons and take action before it becomes an issue. > > -- > Sami Imseih > Amazon Web Services (AWS) >
