I've just realized that the locking considerations in this patch are considerably more complicated than we thought. The discussion so far considered only half of what the deletion interlock needs to accomplish. We have to ensure that indexscans don't lose their place, which is solved in the patch by stopping between pages rather than on a specific item. But we also have to ensure that indexscans don't return pointers to heap tuples that have already been deleted by VACUUM --- at least in the case where the caller is using a non-MVCC snapshot. Else, if the tuple is replaced by a newer tuple before the caller is able to visit it, the caller might mistakenly accept that tuple as the one it wanted.
The problematic scenario looks like this: 1. indexscan slurps a bunch of heap TIDs into local memory. It keeps pin, but not lock, on the index page it got the TIDs from. 2. someone else splits the index page, which is allowed since the page is only pinned not locked. Now, some of the index items are on a new page that the indexscan does not have pinned. 3. vacuum comes along and removes some of the moved items. Since they are on a page that no one has pinned, even the super-exclusive lock that vacuum takes won't prevent this. 4. vacuum removes the corresponding heap tuples. 5. someone else installs new, unrelated, tuples into the freed heap slots. 6. indexscan finally visits the heap. It finds tuples that are valid per its snapshot, but aren't what it thought it was finding (they don't necessarily match the index key). While we could prevent this by rechecking whether fetched heap tuples match the index search condition, that costs an awful lot of cycles for such an obviously rare problem. We need a locking-based solution instead. I believe an appropriate fix looks like this: 1. Indexscan is required to keep pin on the page it read the items from (even though we realize they may not still be on that page). The pin can only be dropped when we are no longer going to return any more items read from the page. 2. btbulkdelete is required to get super-exclusive lock on *every leaf page in the index*. It must not try to optimize this down to just the pages containing deletable items. With these rules, even though vacuum might have physically deleted the index item that the indexscan is reporting to its caller, the btbulkdelete call will not have been able to complete. (If bulkdelete arrived at the original page before the indexscan did, then of course it'd already have deleted the item and there's no problem. If it arrives after, then it'll be blocked by not being able to get super-exclusive lock.) Hence vacuum will not have removed the heap tuple yet, even though the index item might be gone. We could improve concurrency if we extend the API so that btgettuple knows whether it is fetching for an MVCC-safe snapshot or not. When the scan is using an MVCC-safe snapshot it's OK to release pin immediately. However I'm not sure if this is worth the trouble --- it probably makes sense to hold onto the pin anyway, in case we're told to mark some of the tuples killed. The patch as given gets point 2 right, but I think it doesn't consistently do point 1, and in any case it's certainly not documenting the issue properly. BTW, I just realized another bug in the patch: btbulkdelete fails to guarantee that it visits every page in the index. It was OK for btvacuumcleanup to ignore pages added to the index after it starts, but btbulkdelete has to deal with such pages. One thing I'm kind of wondering is whether amgetmulti should still exist as a separate API. Seems like if amgettuple is doing the equivalent thing under-the-hood, then amgetmulti isn't saving us anything except a few trips through FunctionCallN. It's not at all clear that it's worth the API complexity of having two different fetch functions. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org