On Wed, May 23, 2012 at 2:05 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > In any case, I would've expected we'd end up changing all these page > formats if we go over to your concept of a metapage for everything. > Fixing indexes that already have some kind of metapage will surely be > trivial compared to fixing heaps.
And on that note... let's talk (more) about how we do that. As far as I can see, there are basically three possible ways we could add or upgrade metapages: (1) put the logic in the old server, so that you convert everything to the new format first and then upgrade; (2) put the logic in pg_upgrade, so that the format conversion happens while pg_upgrade is running; or (3) put the logic in the new server, so that you upgrade first and then convert. Various hybrid options are also possible. I am inclined to reject (1) out of hand because it would require back-patching the format conversion code to every server version from which we wish to support upgrades, which doesn't sound prudent, and it would require people to be running a sufficiently-new minor release of those server versions in order to upgrade, which is inconvenient. (2) is appealing by virtue of avoiding the need to simultaneously understand multiple page formats. The major disadvantages are (a) we can't do anything very complicated without a running server, and (b) the database is probably toast if we crash midway through the operation, since pg_upgrade can't write WAL. The latter problem likely isn't really an issue in copy mode, but in link mode it seems pretty serious: a badly timed crash can eat all of your data. We could get around this, maybe, by having pg_upgrade copy files rather than link them if a metapage must be added (but that'll be slower), or we could invent some kind of WAL specifically for pg_upgrade (write a record of the pre and post images of all the blocks we're changing somewhere and fsync it before beginning to really make the changes; if the server crashes midway through or the writes fail, you can run pg_upgrade --undo to try to revert the changes, or something like that). (3) lets you make use of the server infrastructure (WAL, error recovery, etc.) and allows mode complex changes to be made, but adds complexity to the server. It seems pretty clear that for the index types that already have metapages (btree, gin, hash, spgist), things are not too bad. We have to decide exactly where to rewrite the first block of the relation, but that single-block change is all we need to do, so it's relatively isolated. So the hard cases are gist and heaps. I asked Heikki for suggestions on those cases and he proposed the following ideas: - For GIST, Heikki proposed that we relocate the root page (which is currently at page 0) to a new page at the end of the relation (or maybe to a free page), overwrite the root page with a metapage, and put a pointer to the root into the metapage; this would mimic what we already do for btree. I think this is a relatively clean solution, and it's clearly simple enough to be done inside pg_upgrade if we want to avoid having the server know about it. - For the heap, Heikki proposed that we do, well, basically the same thing. Relocate block 0 to the end of the relation or to a free block, store the block to which it was moved in the metapage, and teach the index-scan machinery that whenever it sees a TID with block 0 it should replace the block number with the block number to which the original block 0 was relocated. That's kind of a hack, but Heikki likes it because it potentially lets most of the work happen inside pg_upgrade. (We could make ALL the work happen inside pg_upgrade by letting it scan the indexes and fix up the TIDs, but that's not a lot more work that has to be done while the server is off-line.) The only real downside of this method is that the check for block 0 has to be done every time we fetch a TID from the index, which is a little annoying, but then again if page 0 is going to be reserved for a metapage we might want to guard against finding that value in an index TID anyway. - Alternatively, we could handle the heap case by taking an AccessExclusiveLock and reinserting all the tuples in block 0. This clearly isn't something pg_upgrade would do, so presumably the new server would have to do it. One argument in favor of this approach is that a number of the things for which we might really want a metapage (like, the state we need to track to make an unlogged table logged or the other way around) aren't going to be needed until somebody runs a DDL command on the relation anyway. So even if a lot of time goes by before the metapage gets added, maybe it doesn't really matter; and when it does get added, it'll be at a time when we had to lock the relation anyway for other reasons. However, this argument kind of goes out the window if you want to use the metapage for stuff like relfrozenxid or the initial segment of the visibility map, which I do. So I'm inclined to think this is a dead end unless someone has an idea for how to rescue it. - A third possibility is that, when pg_upgrade is used, we could arrange to write the metadata for non-system-catalog tables into a separate relation fork. Any new relations created after the upgrade would store the metadata in block 0 of the main fork, and running CLUSTER, VACUUM FULL, or a rewriting ALTER TABLE command on the relation would incorporate the metapage into the main fork as well. So the extra metadata fork would only exist for relations brought over via pg_upgrade that had not subsequently been subjected to a rewriting operation. Under this design, the metapage always exists, but in some cases it may be located in a different relation fork. The advantage of this idea is just that index scans need not have any kind of special case for block 0: but on the flip side, everyone who uses the metapage for anything needs to be able to find it in one of two possible locations. That might not be too bad, though: there shouldn't be that many people who need direct access to the metapage information; mostly, we should probably be keeping a copy in each backend's relcache. Thoughts, preferences, other ideas? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers