On 07/02/2026 17:54, Robert Haas wrote:
It's worth considering why this patch set hasn't made more progress up
until this point. It could be simply that the patch set is big and
nobody quite has time to review it thorougly. However, it's my
observation that when there's a patch set floating around for years
that fixes a problem that other committers know to be important and
yet it doesn't get committed, it's often a sign that some committers
have taken a peek at the patch set and don't believe that it's taking
the right approach, or don't believe the code is viable, or believe
that fixing the code to be viable would require way more work than
they can justify putting into someone else's patch. I've seen cases
where committers have explicitly said this on list and the patch
authors just keep posting new versions anyway -- but I'm sure it also
happens that people don't want to offend the patch author or get into
an argument and just quietly move on to other things. It would be nice
to hear from some other committers or senior hackers whether they've
studied this patch set and whether they believe it to be something
that is in a shape that we could consider proceeding with it or not.
For myself, I have not studied it.

FWIW I've looked at this several times in the past. It's a big patch and I haven't had the energy to review it thoroughly enough to actually commit. As you said, it's scary because of the risk of corruption. But I think this is viable and basically the right approach.

The thing I like least about this is how the upgrade works, i.e. the conversion code and the "double xmax" hack. This would look much nicer if we could start from clean slate and just add the fields we need to the page header. However, upgrade is important, that point has been discussed a lot on the list, and I don't have any better ideas. I think it's as good as it gets at the high level.

(I'm talking about the high-level design here. There are a lot of small issues here and there and tons of cleanup needed, like all the stuff that you just pointed out.)

+At first read of a heap page after pg_upgrade from 32-bit XID PostgreSQL
+version pd_special area with a size of 16 bytes should be added to a page.
+Though a page may not have space for this. Then it can be converted to a
+temporary format called "double XMAX".

This section generally makes sense to me but fails to explain how we
know that a given tuple is in double XMAX format. Does that use an
infomask bit or what? Consuming an infomask bit might be objectionable
to some hackers, as they are a precious and *extremely* limited
resource.

I am not at all convinced that we should use 16 bytes for this. It
seems to me that it would be a lot simpler to just store the epoch in
the page header (and the equivalent thing for MXIDs). I think actually
using exactly 8 bytes is not appealing, because if we insist that
every tuple on a page has to be from the same epoch, then that means
that when the epoch changes, the next change to every single page in
the system will have to rejigger the whole page, which might suck.

Moreover, you simply cannot insist that every tuple on the page is from the same epoch. If the page contains a tuple with an xmin from previous epoch, which is not yet visible to all snapshots, and you want to insert a new tuple to it with the new epoch, what do you do? You can't freeze the existing tuple's xmin yet.

But what I have discussed in the past (and I think on the list) is
the idea of using half-epochs: the page header says something like
"epoch 1234, first half" or "epoch 1234, second half". In the first
case, all XIDs observed on the page are in epoch 1234. In the second
case, any XIDs < 2^31 are in epoch 1234 and any > 2^31 are in epoch
1235. That way, pages where all tuples are relatively recent will
almost never need updating when we advance into a new half-epoch:
most of the time, we'll just be able to bump to the next half-epoch
without changing any tuples.

Hmm, so that's equivalent to storing a 33-bit base XID, instead of a 64-bit base. Yeah, I think that works, that can represent any two XIDs that are less than 2^31 XIDs apart.

Another way of thinking about this is that we don't really need a
64-bit base XID, because we don't need that much precision. It's fine
to say that the base XID always has to be a multiple of 2^31 or 2^32,
meaning that we only need 32 or 33 bits for it. We can save a number
of bytes in every page this way, and I think the logic will be simpler
as well.

Right. 32 bits is not enough, but 33 is.

I assumed that our plan would be to continue to restrict the range
of *running* transactions to no more than 2^31 XIDs from oldest to newest. On disk, we'd allow older XIDs to exist, but any time we
move the page-level base XID, we know that those older tuples are
eligible to be frozen, and we freeze them. That way, nothing in
memory needs any changes from the way it works now.

That's been my assumption too.

- Heikki


Reply via email to