[Resending: apparently the previous one to the list was eaten by spam filters or something. Changing SMTP relay again ... ]
Ok, let's step back to discuss this again. Sorry for the length -- this is a description of the problem I'm trying to solve, the issues I found, and how I tried to solve them. The relminxid Patch =================== What I'm after is not freezing for read-only media, nor archive, nor read-only tables. What I'm after is removing the requirement that all databases must be vacuumed wholly every 2 billion transactions. Now, why do we need to vacuum whole databases at a time? The Transaction Id Counter ========================== We know that the Xid counter is weird; it cycles, for starters, and also there are special values at the "start" of the cycle that are lesser than all other values (BootstrapXid, FrozenXid). The idea here is to allow the counter to wrap around and old tuples not be affected, i.e., appear like they were committed in some distant past. So we use the special Xid values to mark special stuff, like tuples created by the bootstrap processing (which are always known to be good) or tuples in template databases that are not connectable ("frozen" databases). We also use FrozenXid to mark tuples that are very old, i.e. were committed a long time ago and never deleted. Any such tuple is unaffected by the status of the Xid counter. It should be clear that we must ensure that after a suitable amount of "time" (measured in advancement of the Xid counter) has passed, we should change the old Xids in tuples to the special FrozenXid value. The requirement for whole-database vacuuming is there because we need to ensure that this is done in all the tables in the database. We keep track of a "minimum Xid", call it minxid. The Xid generator refuses to assign a new Xid counter if this minxid is too far in the past, because we'd risk causing Xid-wraparound data loss if we did; the Xid comparison semantics would start behaving funny, and some tuples that appeared to be alive not many transactions ago now suddenly appear dead. Clearly, it's important that before we advance this minxid we ensure that all tables in the database have been under the process of changing all regular Xids into FrozenXid. Currently the only way to ensure that all tables have gone through this process is processing them in a single VACUUM pass. Skip even one table, and you can forget about advancing the minxid. Even if the skipped table was vacuumed in the transaction just before this one. Even if the table is fully frozen, i.e., all tables on it are marked with FrozenXid. Even if the table is empty. Tracking minxid Per Table ========================= So, my idea is to track this minxid per table. To do this, I added a column to pg_class called relminxid. The minimum of it across a database is used to determine each database's minimum, datminxid. The minimum of all databases is used to advance the global minimum Xid counter. So, if a table has 3 tuples whose Xmins are 42, 512 and FrozenXid, the relminxid is 42. If we keep track of all these religiously during vacuum, we know exactly what is the minxid we should apply to this particular table. It is obvious that vacuuming one table can set the minimum for that table. So when the vacuuming is done, we can recalculate the database minimum; and using the minima of all databases, we can advance the global minimum Xid counter and truncate pg_clog. We can do this on each single-table vacuum -- so, no more need for database-wide vacuuming. If a table is empty, or all tuples on it are frozen, then we must mark the table with relminxid = RecentXmin. This is because there could be an open transaction that writes a new tuple to the table after the vacuum is finished. A newly created table must also be created with relminxid = RecentXid. Because of this, we never mark a table with relminxid = FrozenXid. Template Databases ================== Up to this point everything is relatively simple. Here is where the strange problems appear. The main issue is template databases. Why are template databases special? Because they are never vacuumed. More generally, we assume that every database that is marked as "datallowconn = false" is fully frozen, i.e. all tables on it are frozen. Autovacuum skips them. VACUUM ignores them. The minxid calculations ignore them. They are fully frozen so they don't matter and they don't harm anybody. That's fine and dandy until you realize what happens when you freeze a database, let a couple billion transactions pass, and then create a database using that as a template (or just "reallow connections" to a database). Because all the tables were frozen 2 billion transaction ago, they are marked with an old relminxid, so as soon as you vacuum any table, the minxid computations went to hell, and we have a DoS condition. So, we have to do something to cope with frozen databases. I see two ways: 1. Remove the special case, i.e., process frozen databases in VACUUM like every other database. This is the easiest, because no extra logic is needed. Just make sure they are vacuumed in time. The only problem would be that we'd need to uselessly vacuum tables that we know are frozen, from time to time. But then, those tables are probably small, so what's the problem with that? 2. Mark frozen databases specially somehow. To mark databases frozen, we need a way to mark tables as frozen. How do we do that? As I explain below, this allows some nice optimizations, but it's a very tiny can full of a huge amount of worms. Marking a Table Frozen ====================== Marking a table frozen is simple as setting relminxid = FrozenXid for a table. As explained above, this cannot be done in a regular postmaster environment, because a concurrent transaction could be doing nasty stuff to a table. So we can do it only in a standalone backend. On the other hand, a "frozen" table must be marked with relminxid = a-regular-Xid as soon as a transaction writes some tuples on it. Note that this "unfreezing" must take place even if the offending transaction is aborted, because the Xid is written in the table nevertheless and thus it would be incorrect to lose the unfreezing. This is how pg_class_nt came into existence -- it would be a place where information about a table would be stored and not subject to the rolling back of the transaction that wrote it. So if you find that a table is frozen, you write an unfreezing into its pg_class_nt tuple, and that's it. Nice optimization: if we detect that a table is fully frozen, then VACUUM is a no-op (not VACUUM FULL), because by definition there are no tuples to remove. Another optimization: if we are sure that unfreezing works, we can even mark a table as frozen in a postmaster environment, as long as we take an ExclusiveLock on the table. Thus we know that the vacuum is the sole transaction concurrently accessing the table; and if another transaction comes about and writes something after we're finished, it'll correctly unfreeze the table and all is well. Where are the problems in this approach? 1. Performance. We'll need to keep a cache of pg_class_nt tuples. This cache must be independent of the current relcache, because the relcache is properly transactional while the pg_class_nt cache must not be. 2. The current implementation puts the unfreezing in LockRelation. This is a problem, because any user can cause a LockRelation on any table, even if the user does not have access to that table. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster