On Tue, Nov 25, 2025 at 1:12 AM Maxim Orlov <[email protected]> wrote: > Here is a rebase @ 4d3f623ea88 > > The patch set has grown increasingly old. It appears to me that > everything requires to be thoroughly overhauled to bring it up to date. > Especially given that the first patch to be committed switches > multi-offsets to 64-bit and utilizes a different pg_upgrade method. > > I am planning to tackle this in the near future and give the patch a > thorough overhaul.
Here's a bit of high-level review of this patch set, focusing on the README. In general, I believe that the design choice to store the high bits of the XID and MXID in the special space of heap page has the support of community members. It does represent a space-vs-convenience tradeoff, in the sense that the bytes we use to store this information can no longer be used for tuple data. However, it's only a very small fraction of the space on the page, and as transaction rates continue to climb, it's easier and easier to run into wraparound problems. Eventually, we have to do something about them, and right now, we have no competing proposal. That said, this is an incredibly dangerous patch set. It will affect every since user of PostgreSQL, there's no way to turn it off if it doesn't work, it could result in irrecoverable data loss or corruption if it doesn't work properly, problems may take a long time to surface because of the nature of wraparound, and fixing any problems in an acceptable way in a minor release could be incredibly difficult. Therefore, I think it would be wise for anyone thinking of committing this patch to exercise *extreme* caution. If we introduce a bunch of data-corrupting bugs here, it could severely damage the reputation of the project for years to come. It's worth considering why this patch set hasn't made more progress up until this point. It could be simply that the patch set is big and nobody quite has time to review it thorougly. However, it's my observation that when there's a patch set floating around for years that fixes a problem that other committers know to be important and yet it doesn't get committed, it's often a sign that some committers have taken a peek at the patch set and don't believe that it's taking the right approach, or don't believe the code is viable, or believe that fixing the code to be viable would require way more work than they can justify putting into someone else's patch. I've seen cases where committers have explicitly said this on list and the patch authors just keep posting new versions anyway -- but I'm sure it also happens that people don't want to offend the patch author or get into an argument and just quietly move on to other things. It would be nice to hear from some other committers or senior hackers whether they've studied this patch set and whether they believe it to be something that is in a shape that we could consider proceeding with it or not. For myself, I have not studied it. But I did read the README, so here are some comments about that: +A limited number (N = 2^32) of XID's required to do vacuum freeze to prevent +wraparound every N/2 transactions. This causes performance degradation due +to the need to read and rewrite all not yet frozen pages tables while being +vacuumed. In each wraparound cycle, SLRU buffers are also being cut. Here and throughout the README and really the entire patch set, you need to refer to concepts that exist elsewhere in the system using the same terminology that we use for those things in other places. We talk about SLRUs being truncated, not SLRU buffers being cut. It's also incorrect to refer to having to do vacuum freeze: FREEZE is an option to VACUUM that causes the VACUUM to be performed with the newest possible XID and MXID cutoffs. That is not required. What is required is an aggressive vacuum. Here and throughout this patch set, greater linguistic precision than this is absolutely required. Stepping back from language issues, I feel that relfrozenxid/relminmxid advancement serves two purposes that aren't clearly distinguished here. One is to permit SLRU truncation, and the other is to avoid exhausting space in the SLRU. If we change things so that we can use essentially unlimited space within each SLRU, we still need to truncate them often enough that the disk usage doesn't become disproportionate. I think whoever wrote this knows that, but it isn't very clearly explained here. +With 64-bit XID's wraparound is effectively postponed to a very distant +future. Even in highly loaded systems that had 2^32 transactions per day +it will take huge 2^31 days before the first enforced "vacuum to prevent +wraparound"). Buffers cutting and routine vacuum are not enforced, and DBA +can plan them independently at the time with the least system load and least +critical for database performance. Also, it can be done less frequently +(several times a year vs every several days) on systems with transaction rates +similar to those mentioned above. "Buffers cutting and routine vacuum are not enforced" is not clear at all and not our standard terminology. Routine vacuuming, and autovacuum, are still clearly needed, else bloat would rapidly paralyze the system. I'm guessing this refers to the timing of aggressive vacuums, but it's unclear whether this is saying that the patch makes a change to how those work as compared with the current system, or just that there is more theoretical flexibility. The logic behind "it can be done less frequently" should be explained. I assume that what this really means is that aggressive vacuums are still required to prevent unreasonable disk space utilization by the clog and multixact SLRUs, but that the author of this statement believes that it's reasonable to use a lot more disk space for those SLRUs than we currently do systems with lots of (M)XID consumption. But that could be spelled out a lot better than it is. Also, the statement "With 64-bit XID's wraparound is effectively postponed to a very distant future" makes it sound as if we're still going to keep the code that handles wraparound, but how could it ever trigger? The LSN space is only 64 bits wide, and I assume there's no reasonable way to consume XIDs or MXIDs faster than we consume LSNs. +At first read of a heap page after pg_upgrade from 32-bit XID PostgreSQL +version pd_special area with a size of 16 bytes should be added to a page. +Though a page may not have space for this. Then it can be converted to a +temporary format called "double XMAX". This section generally makes sense to me but fails to explain how we know that a given tuple is in double XMAX format. Does that use an infomask bit or what? Consuming an infomask bit might be objectionable to some hackers, as they are a precious and *extremely* limited resource. I am not at all convinced that we should use 16 bytes for this. It seems to me that it would be a lot simpler to just store the epoch in the page header (and the equivalent thing for MXIDs). I think actually using exactly 8 bytes is not appealing, because if we insist that every tuple on a page has to be from the same epoch, then that means that when the epoch changes, the next change to every single page in the system will have to rejigger the whole page, which might suck. But what I have discussed in the past (and I think on the list) is the idea of using half-epochs: the page header says something like "epoch 1234, first half" or "epoch 1234, second half". In the first case, all XIDs observed on the page are in epoch 1234. In the second case, any XIDs < 2^31 are in epoch 1234 and any > 2^31 are in epoch 1235. That way, pages where all tuples are relatively recent will almost never need updating when we advance into a new half-epoch: most of the time, we'll just be able to bump to the next half-epoch without changing any tuples. (On the other hand, if we're going to dirty the page anyway, how much does it matter if we have to adjust a bunch of tuple headers at the same time? So maybe using whole epochs rather than half-epochs is fine.) Another way of thinking about this is that we don't really need a 64-bit base XID, because we don't need that much precision. It's fine to say that the base XID always has to be a multiple of 2^31 or 2^32, meaning that we only need 32 or 33 bits for it. We can save a number of bytes in every page this way, and I think the logic will be simpler as well. +In-memory tuple representation consists of two parts: +- HeapTupleHeader from disk page (contains all heap tuple contents, not only +header) +- HeapTuple with additional in-memory fields + +HeapTuple for each tuple in memory stores 64bit XMIN/XMAX. They are +precalculated on tuple read from page with (1) and (2). + +The filling of XMIN and XMAX in HeapTuple is done in the same way as the other +fields of HeapTuple struct. It is done in all cases of HeapTuple manipulation. I don't find this discussion entirely clear about what the patch is actually doing. It sounds like we're finding a way to store the high-order bits of the XMIN and XMAX (or multi-XMAX) in every HeapTuple, but I'm not sure that I'm interpreting this correctly. If that is what we're doing, I'm concerned that such a wide-reaching change could be destabilizing and have major code-maintenance implications. It seems like we could avoid the need for this. I assumed that our plan would be to continue to restrict the range of *running* transactions to no more than 2^31 XIDs from oldest to newest. On disk, we'd allow older XIDs to exist, but any time we move the page-level base XID, we know that those older tuples are eligible to be frozen, and we freeze them. That way, nothing in memory needs any changes from the way it works now. +avoid very-long-living transactions with an age close to 2^32. So long-living +transactions often they are most likely defunct. This is not good English. +On insert we check if current XID fits a range (3). Otherwise: + +- heap_page_prepare_for_xid() will try to increase pd_xid_base for t_xmin will +not be over MaxShortTransactionId. + +- If it is impossible, then it will try to prune and freeze tuples on a page. + +Known issue: if pd_xid_base could not be shifted to accommodate a tuple being +inserted due to a very long-running transaction, we just throw an error. We +neither try to insert a tuple into another page nor mark the current page as +full. So, in this (unlikely) case we will get regular insert errors on the next +tries to insert to the page 'locked' by this very long-running transaction. So, essentially what this is saying is that the patch set effectively retains the limitation how old the oldest running XID can be, but instead of enforcing that limitation the way we do now, it enforces it by making inserts fail unpredictably whenever it can't figure out a workaround. I think that's clearly worse. This makes me feel that my above suggestion of keeping the 2^31 limitation on the oldest running XID is the right way to go. +3. If a page is in double XMAX format after its first read, and vacuum (or +micro-vacuum at select query) could prune some tuples and free space for micro-vacuum is referring to HOT pruning. This is another example of how this README is inventing its own terms for things instead of using standard terms. +There is a special case when the first read of a tuple is done in read-only +state (in read-only transaction or on replica). This tuples are to be converted +"in memory", but not sync "to disk", unless cluster or transaction changed to +read-write state (e.g. replica is promoted). In order to support this, we mark +"in memory" pages with converted tuples with bit REGBUF_CONVERTED in buffer +descriptor. When in read-write state this will trigger full page write xlog +record. There's a lot to unpack here. First, REGBUF_CONVERTED sounds like the name of a XLogRegisterBuffer flag, rather than a buffer descriptor flag. The latter type of flags start with BM_, not REGBUF_. What I infer from this is that the patch set proposes to delete the ability of a heap scan to interpret a page in the current format; therefore, even when we can't actually modify the buffer, we MUST convert it to the new format in memory instantly. If that's what's happening, it should be spelled out instead of just being something you can maybe infer from what you are being told. And, we should maybe be told why we're doing that. I cant really see why we would want to go that way off-hand. Most of our code is already aware of the theoretical possibility of a page having special space, and I would think that a scan of a page with or without special space should mostly just work. Why not just adjust that code so that if special space is present, we use that to infer the xid and base_xid, and if not, we infer them some other way? I don't actually quite understand how this all fits together. It seems like converting to double XMAX format requires inferring a base XID and base MXID for the page, but how would we do that? It seems like the correct values would be those in effect at the time of the pg_upgrade, but how would we know that at the time of reading the page? And if we do know it, then why not get rid of double XMAX format altogether? We would then just have the current page format with or without the special space, with the base (M)XID coming either from the special space or from whatever mysterious source of information allows us to do the double XMAX conversion correctly in the present coding. -- Robert Haas EDB: http://www.enterprisedb.com
