So, what do we need in order to find our way to index-only scans? 1. The visibility map needs to be crash-safe. The basic idea of index-only scans is that, instead of checking the heap to find out whether each tuple is visible, we first check the visibility map. If the visibility map bit is set, then we know all tuples on the page are visible to all transactions, and therefore the tuple of interest is visible to our transaction. Assuming that a significant number of visibility map bits are set, this should enable us to avoid a fair amount of I/O, especially on large tables, because the visibility map is roughly 8000 times smaller than the heap, and therefore far more practical to keep in cache. However, before we can rely on the visibility map for this purpose, we need to fix the problems that can leave bits set inappropriately in the face of an inconveniently-timed crash. I've been working on a patch for this on and off for a few months now; my latest version is in need of review.
2. Crash safe visibility map vs. pg_upgrade. Even if we make the visibility map crash-safe in 9.2, people are going to want to use pg_upgrade to migrate from older versions, bringing their possibly-not-quite-correct visibility map forks along with them. How should we handle that? We could (2A) arrange to have pg_upgrade nuke all visibility forks when upgrading from a release where the visibility map is not crash-safe to one where it is; (2B) keep a piece of state somewhere indicating, for each relation, whether or not the visibility map can be trusted, set it to false only if pg_upgrade brings the relation over from and older version, and set it to true after a successful vacuum that skips no intervening pages; or (2C) advise the user to do a VACUUM FULL on each of their tables pre-upgrade, and if they don't, treat wrong answers as their own fault. (I doubt anyone will advocate for this option, but for the sake of completeness...). (2A) seems like the simplest solution, especially because it also avoids the overhead of checking the "is the visibility map bit reliable?" flag every time we want to plan a query. 3. Statistics. I believe that in order to accurately estimate the cost of an index-only scan, we're going to need to know the fraction of tuples that are on pages whose visibility map bits are set. I believe it should be fairly straightforward to have ANALYZE collect this information; and I'm inclined to do that as a separate patch. It seems like it would also be nice to know what fraction of tuples are on pages that don't have the visibility map set but where, in fact, all tuples on the page are visible to all transactions, so it would be legal to set the bit. A large discrepancy between these two percentages might be a good reason to auto-vacuum the table (perhaps using a "really lazy vacuum"). I'm not sure if this can be added cheaply, though, and in any case, any change to the set of criteria that will trigger an auto-vacuum is probably a can of worms. Thoughts? 4. Planner and executor changes. In contrast to Heikki's original implementation, I'm inclined to not to try to split the Index Scan node into index scan and heap fetch. Since there are many choices for where to put the heap fetch node (any level of the join tree between the index scan and the root), this seems likely to result in a combinatorial explosion of paths, and I'm not real sure that the payback will be adequate. Furthermore, the idea of allowing user code to see tuples that will only later be determined not to have been visible to that MVCC snapshot in the first place seems pretty scary from a security perspective, though certainly there are possible benefits. Instead, I'm inclined to just have the planner evaluate whether the necessary columns can be extracted purely from the index. If not, we proceed as now. If so, we can use the "index only" approach of using the visibility map to decide which heap fetches can be skipped. It's not clear to me whether we need to compare the cost of the standard approach with the cost of the "index only" approach: in theory, if there aren't any visibility map bits anyway, the "index only" approach could be slower. But I'm not sure whether that problem is significant or common enough to be worth expending a lot of code on. Either way, the number of actual paths doesn't need to increase, because in this design, even if we apply a costing model, one approach will dominate the other. Heikki also suggested considering index scans in cases where we don't now[4, again] but I'm inclined to leave this, too, for a later optimization, again because balancing the increase in paths against the possible performance benefits of using indexes in more situations seems finicky. In short, for a first cut at this, I just want to look at this as a way to get cheaper index scans, and leave everything else to future work. Any thoughts welcome. Incidentally, if anyone else feels like working on this, feel free to let me know and I'm happy to step away, from all of it or from whatever part someone else wants to tackle. I'm mostly working on this because it's something that I think we really need to get done, more than having a burning desire to be the one who does it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company  http://archives.postgresql.org/pgsql-hackers/2011-05/msg00292.php  http://archives.postgresql.org/pgsql-hackers/2011-03/msg00946.php  http://archives.postgresql.org/pgsql-hackers/2009-09/msg01379.php  http://archives.postgresql.org/pgsql-hackers/2009-07/msg00675.php -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers