Heikki Linnakangas wrote: > On 04.12.2010 09:14, jes...@krogh.cc wrote: > > There has been a lot discussion about index-only scans and how to make the > > visibillity map crash safe. Then followed by a good discussion about hint > > bits. > > > > What seems to be the main concern is the added wal volume and it makes me > > wonder if there is a way in-between that looks more like hint bits. > > > > How about lazily wal-log the complete visibility map say every X minutes or > > N amount of tuple updates and make the wal recovery jobs of rechecking > > visibility of pages touched by the wal stream on recovery. > > If you WAL-log the visibility map changes after-the-fact, it doesn't > solve the race condition we're struggling with: the visibility map > change might hit the disk before the PD_ALL_VISIBLE to the heap page. If > you crash, you can end up with a situation where the PD_ALL_VISIBLE flag > on the heap page is not set, but the bit in the visibility map is. Which > causes serious issues later on.
Based on hacker emails and a discussion I had with Heikki while we were in Germany, I have updated the index-only scans wiki to document a known solution to making the visibility map crash-safe for use by index-only scan use: http://wiki.postgresql.org/wiki/Index-only_scans#Making_the_Visibility_Map_Crash-Safe Making the Visibility Map Crash-Safe Currently, a heap page that has all-visible tuples is marked by vacuum as PD_ALL_VISIBLE and the visibility map (VM) bit is set. This is currently unlogged, and a crash could require these to be set again. The complexity is that for index-only scans, the VM bit has meaning, and cannot be incorrectly set (though it can be incorrectly cleared because that would just result in additional heap access). If both PD_ALL_VISIBLE and the VM bit were to be set, and a crash resulted the VM bit being written to disk, but not the PD_ALL_VISIBLE bit, a later heap access that wrote a conditionally-visible row would not know to clear the VM bit, causing incorrect results for index-only scans. The solution is to WAL log the VM set bit activity. This will cause full-page writes for the VM page, but this is much less than WAL-logging each heap page because a VM page represents many heap pages. This requires that the VM page not be written to disk until its VM-set WAL record is fsynced to disk. Also, during crash recovering, reading the VM-set WAL record would cause both the VM-set and heap PD_ALL_VISIBLE to be set. -- Bruce Momjian <br...@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers