On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund <and...@2ndquadrant.com> wrote: > On 2013-10-22 13:57:53 -0400, Robert Haas wrote: >> On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund <and...@2ndquadrant.com> >> wrote: >> >> That strikes me as a flaw in the implementation rather than the idea. >> >> You're presupposing a patch where the necessary information is >> >> available in WAL yet you don't make use of it at the proper time. >> > >> > The problem is that the mapping would be somewhere *ahead* from the >> > transaction/WAL we're currently decoding. We'd need to read ahead till >> > we find the correct one. >> >> Yes, I think that's what you need to do. > > My problem with that is that rewrite can be gigabytes into the future. > > When reading forward we could either just continue reading data into the > reorderbuffer, but delay replaying all future commits till we found the > currently needed remap. That might have quite the additional > storage/memory cost, but runtime complexity should be the same as normal > decoding. > Or we could individually read ahead for every transaction. But doing so > for every transaction will get rather expensive (rougly O(amount_of_wal^2)).
[ Sorry it's taken me a bit of time to get back to this; other tasks intervened, and I also just needed some time to let it settle in my brain. ] If you read ahead looking for a set of ctid translations from relfilenode A to relfilenode B, and along the way you happen to encounter a set of translations from relfilenode C to relfilenode D, you could stash that set of translations away somewhere, so that if the next transaction you process needs that set of mappings, it's already computed. With that approach, you'd never have to pre-read the same set of WAL files more than once. But, as I think about it more, that's not very different from your idea of stashing the translations someplace other than WAL in the first place. I mean, if the read-ahead thread generates a series of files in pg_somethingorother that contain those maps, you could have just written the maps to that directory in the first place. So on further review I think we could adopt that approach. However, I'm leery about the idea of using a relation fork for this. I'm not sure whether that's what you had it mind, but it gives me the willies. First, it adds distributed overhead to the system, as previously discussed; and second, I think the accounting may be kind of tricky, especially in the face of multiple rewrites. I'd be more inclined to find a separate place to store the mappings. Note that, AFAICS, there's no real need for the mapping file to be block-structured, and I believe they'll be written first (with no readers) and subsequently only read (with no further writes) and eventually deleted. One possible objection to this is that it would preclude decoding on a standby, which seems like a likely enough thing to want to do. So maybe it's best to WAL-log the changes to the mapping file so that the standby can reconstruct it if needed. > I think that'd be pretty similar to just disallowing VACUUM > FREEZE/CLUSTER on catalog relations since effectively it'd be to > expensive to use. This seems unduly pessimistic to me; unless the catalogs are really darn big, this is a mostly theoretical problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers