On Tue, Nov 25, 2025 at 1:12 AM Maxim Orlov <[email protected]> wrote:
> Here is a rebase @ 4d3f623ea88
>
> The patch set has grown increasingly old. It appears to me that
> everything requires to be thoroughly overhauled to bring it up to date.
> Especially given that the first patch to be committed switches
> multi-offsets to 64-bit and utilizes a different pg_upgrade method.
>
> I am planning to tackle this in the near future and give the patch a
> thorough overhaul.

Here's a bit of high-level review of this patch set, focusing on the README.

In general, I believe that the design choice to store the high bits of
the XID and MXID in the special space of heap page has the support of
community members. It does represent a space-vs-convenience tradeoff,
in the sense that the bytes we use to store this information can no
longer be used for tuple data. However, it's only a very small
fraction of the space on the page, and as transaction rates continue
to climb, it's easier and easier to run into wraparound problems.
Eventually, we have to do something about them, and right now, we have
no competing proposal. That said, this is an incredibly dangerous
patch set. It will affect every since user of PostgreSQL, there's no
way to turn it off if it doesn't work, it could result in
irrecoverable data loss or corruption if it doesn't work properly,
problems may take a long time to surface because of the nature of
wraparound, and fixing any problems in an acceptable way in a minor
release could be incredibly difficult. Therefore, I think it would be
wise for anyone thinking of committing this patch to exercise
*extreme* caution. If we introduce a bunch of data-corrupting bugs
here, it could severely damage the reputation of the project for years
to come.

It's worth considering why this patch set hasn't made more progress up
until this point. It could be simply that the patch set is big and
nobody quite has time to review it thorougly. However, it's my
observation that when there's a patch set floating around for years
that fixes a problem that other committers know to be important and
yet it doesn't get committed, it's often a sign that some committers
have taken a peek at the patch set and don't believe that it's taking
the right approach, or don't believe the code is viable, or believe
that fixing the code to be viable would require way more work than
they can justify putting into someone else's patch. I've seen cases
where committers have explicitly said this on list and the patch
authors just keep posting new versions anyway -- but I'm sure it also
happens that people don't want to offend the patch author or get into
an argument and just quietly move on to other things. It would be nice
to hear from some other committers or senior hackers whether they've
studied this patch set and whether they believe it to be something
that is in a shape that we could consider proceeding with it or not.
For myself, I have not studied it.

But I did read the README, so here are some comments about that:

+A limited number (N = 2^32) of XID's required to do vacuum freeze to prevent
+wraparound every N/2 transactions. This causes performance degradation due
+to the need to read and rewrite all not yet frozen pages tables while being
+vacuumed. In each wraparound cycle, SLRU buffers are also being cut.

Here and throughout the README and really the entire patch set, you
need to refer to concepts that exist elsewhere in the system using the
same terminology that we use for those things in other places. We talk
about SLRUs being truncated, not SLRU buffers being cut. It's also
incorrect to refer to having to do vacuum freeze: FREEZE is an option
to VACUUM that causes the VACUUM to be performed with the newest
possible XID and MXID cutoffs. That is not required. What is required
is an aggressive vacuum. Here and throughout this patch set, greater
linguistic precision than this is absolutely required.

Stepping back from language issues, I feel that
relfrozenxid/relminmxid advancement serves two purposes that aren't
clearly distinguished here. One is to permit SLRU truncation, and the
other is to avoid exhausting space in the SLRU. If we change things so
that we can use essentially unlimited space within each SLRU, we still
need to truncate them often enough that the disk usage doesn't become
disproportionate. I think whoever wrote this knows that, but it isn't
very clearly explained here.

+With 64-bit XID's wraparound is effectively postponed to a very distant
+future. Even in highly loaded systems that had 2^32 transactions per day
+it will take huge 2^31 days before the first enforced "vacuum to prevent
+wraparound"). Buffers cutting and routine vacuum are not enforced, and DBA
+can plan them independently at the time with the least system load and least
+critical for database performance. Also, it can be done less frequently
+(several times a year vs every several days) on systems with transaction rates
+similar to those mentioned above.

"Buffers cutting and routine vacuum are not enforced" is not clear at
all and not our standard terminology. Routine vacuuming, and
autovacuum, are still clearly needed, else bloat would rapidly
paralyze the system. I'm guessing this refers to the timing of
aggressive vacuums, but it's unclear whether this is saying that the
patch makes a change to how those work as compared with the current
system, or just that there is more theoretical flexibility.

The logic behind "it can be done less frequently" should be explained.
I assume that what this really means is that aggressive vacuums are
still required to prevent unreasonable disk space utilization by the
clog and multixact SLRUs, but that the author of this statement
believes that it's reasonable to use a lot more disk space for those
SLRUs than we currently do systems with lots of (M)XID consumption.
But that could be spelled out a lot better than it is.

Also, the statement "With 64-bit XID's wraparound is effectively
postponed to a very distant future" makes it sound as if we're still
going to keep the code that handles wraparound, but how could it ever
trigger? The LSN space is only 64 bits wide, and I assume there's no
reasonable way to consume XIDs or MXIDs faster than we consume LSNs.

+At first read of a heap page after pg_upgrade from 32-bit XID PostgreSQL
+version pd_special area with a size of 16 bytes should be added to a page.
+Though a page may not have space for this. Then it can be converted to a
+temporary format called "double XMAX".

This section generally makes sense to me but fails to explain how we
know that a given tuple is in double XMAX format. Does that use an
infomask bit or what? Consuming an infomask bit might be objectionable
to some hackers, as they are a precious and *extremely* limited
resource.

I am not at all convinced that we should use 16 bytes for this. It
seems to me that it would be a lot simpler to just store the epoch in
the page header (and the equivalent thing for MXIDs). I think actually
using exactly 8 bytes is not appealing, because if we insist that
every tuple on a page has to be from the same epoch, then that means
that when the epoch changes, the next change to every single page in
the system will have to rejigger the whole page, which might suck. But
what I have discussed in the past (and I think on the list) is the
idea of using half-epochs: the page header says something like "epoch
1234, first half" or "epoch 1234, second half". In the first case, all
XIDs observed on the page are in epoch 1234. In the second case, any
XIDs < 2^31 are in epoch 1234 and any > 2^31 are in epoch 1235. That
way, pages where all tuples are relatively recent will almost never
need updating when we advance into a new half-epoch: most of the time,
we'll just be able to bump to the next half-epoch without changing any
tuples.

(On the other hand, if we're going to dirty the page anyway, how much
does it matter if we have to adjust a bunch of tuple headers at the
same time? So maybe using whole epochs rather than half-epochs is
fine.)

Another way of thinking about this is that we don't really need a
64-bit base XID, because we don't need that much precision. It's fine
to say that the base XID always has to be a multiple of 2^31 or 2^32,
meaning that we only need 32 or 33 bits for it. We can save a number
of bytes in every page this way, and I think the logic will be simpler
as well.

+In-memory tuple representation consists of two parts:
+- HeapTupleHeader from disk page (contains all heap tuple contents, not only
+header)
+- HeapTuple with additional in-memory fields
+
+HeapTuple for each tuple in memory stores 64bit XMIN/XMAX. They are
+precalculated on tuple read from page with (1) and (2).
+
+The filling of XMIN and XMAX in HeapTuple is done in the same way as the other
+fields of HeapTuple struct. It is done in all cases of HeapTuple manipulation.

I don't find this discussion entirely clear about what the patch is
actually doing. It sounds like we're finding a way to store the
high-order bits of the XMIN and XMAX (or multi-XMAX) in every
HeapTuple, but I'm not sure that I'm interpreting this correctly.

If that is what we're doing, I'm concerned that such a wide-reaching
change could be destabilizing and have major code-maintenance
implications. It seems like we could avoid the need for this. I
assumed that our plan would be to continue to restrict the range of
*running* transactions to no more than 2^31 XIDs from oldest to
newest. On disk, we'd allow older XIDs to exist, but any time we move
the page-level base XID, we know that those older tuples are eligible
to be frozen, and we freeze them. That way, nothing in memory needs
any changes from the way it works now.

+avoid very-long-living transactions with an age close to 2^32. So long-living
+transactions often they are most likely defunct.

This is not good English.

+On insert we check if current XID fits a range (3). Otherwise:
+
+- heap_page_prepare_for_xid() will try to increase pd_xid_base for t_xmin will
+not be over MaxShortTransactionId.
+
+- If it is impossible, then it will try to prune and freeze tuples on a page.
+
+Known issue: if pd_xid_base could not be shifted to accommodate a tuple being
+inserted due to a very long-running transaction, we just throw an error. We
+neither try to insert a tuple into another page nor mark the current page as
+full. So, in this (unlikely) case we will get regular insert errors on the next
+tries to insert to the page 'locked' by this very long-running transaction.

So, essentially what this is saying is that the patch set effectively
retains the limitation how old the oldest running XID can be, but
instead of enforcing that limitation the way we do now, it enforces it
by making inserts fail unpredictably whenever it can't figure out a
workaround. I think that's clearly worse. This makes me feel that my
above suggestion of keeping the 2^31 limitation on the oldest running
XID is the right way to go.

+3. If a page is in double XMAX format after its first read, and vacuum (or
+micro-vacuum at select query) could prune some tuples and free space for

micro-vacuum is referring to HOT pruning. This is another example of
how this README is inventing its own terms for things instead of using
standard terms.

+There is a special case when the first read of a tuple is done in read-only
+state (in read-only transaction or on replica). This tuples are to be converted
+"in memory", but not sync "to disk", unless cluster or transaction changed to
+read-write state (e.g. replica is promoted). In order to support this, we mark
+"in memory" pages with converted tuples with bit REGBUF_CONVERTED in buffer
+descriptor. When in read-write state this will trigger full page write xlog
+record.

There's a lot to unpack here. First, REGBUF_CONVERTED sounds like the
name of a XLogRegisterBuffer flag, rather than a buffer descriptor
flag. The latter type of flags start with BM_, not REGBUF_.

What I infer from this is that the patch set proposes to delete the
ability of a heap scan to interpret a page in the current format;
therefore, even when we can't actually modify the buffer, we MUST
convert it to the new format in memory instantly. If that's what's
happening, it should be spelled out instead of just being something
you can maybe infer from what you are being told. And, we should maybe
be told why we're doing that. I cant really see why we would want to
go that way off-hand. Most of our code is already aware of the
theoretical possibility of a page having special space, and I would
think that a scan of a page with or without special space should
mostly just work. Why not just adjust that code so that if special
space is present, we use that to infer the xid and base_xid, and if
not, we infer them some other way?

I don't actually quite understand how this all fits together. It seems
like converting to double XMAX format requires inferring a base XID
and base MXID for the page, but how would we do that? It seems like
the correct values would be those in effect at the time of the
pg_upgrade, but how would we know that at the time of reading the
page? And if we do know it, then why not get rid of double XMAX format
altogether? We would then just have the current page format with or
without the special space, with the base (M)XID coming either from the
special space or from whatever mysterious source of information allows
us to do the double XMAX conversion correctly in the present coding.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


Reply via email to