>> It seems that it contradicts the very idea of your patch, so probably we
>> should look for other ways to optimize this use-case.
>> Maybe this restriction can be relaxed for write only tables, that never
>> have to reread the page because of visibility, or something like that.
>> Also we probably can add to IndexScanDescData info about expected number
>> of tuples, to allow index work more optimal
>> and avoid the overhead for other loads.=

> The idea of the patch is exactly to relax this limitation. I forgot to update 
> that README file though. The current implementation of the patch should be 
> correct like this - that's why I added the look-back code on the page if the 
> tuple couldn't be found anymore on the same location on the page. Similarly, 
> it'll look on the page to the right if it detected a page split. These two 
> measures combined should give a correct implementation of the 'it's possible 
> that a scan stops in the middle of a page' relaxation. However, as Peter and 
> Tom pointed out earlier, they feel that the performance advantage that this 
> approach gives, does not outweigh the extra complexity at this time. I'd be 
> open to other suggestions though.

Although now that I think of it - do you mean the case where the tuple that we 
returned to the caller after _bt_first actually gets deleted (not moved) from 
the page? I guess that can theoretically happen if _bt_first returns a 
non-visible tuple (but not DEAD yet in the index at the time of _bt_first). For 
my understanding, would a situation like the following lead to this (in my 
patch)?
1) Backend 1 does an index scan and returns the first tuple on _bt_first - this 
tuple is actually deleted in the heap already, however it's not marked dead yet 
in the index.
2) Backend 1 does a heap fetch to check actual visibility and determines the 
tuple is actually dead
3) While backend 1 is busy doing the heap fetch (so in between _bt_first and 
_bt_next) backend 2 comes in and manages to somehow do 1) a _bt_killitems on 
the page to mark tuples dead as well as 2) compact items on the page, thereby 
actually removing this item from the page.
4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to 
locate the tuple where it left off, but cannot find it anymore because it got 
removed completely by backend 2.

If this is indeed possible then it's a bad issue unfortunately, and quite hard 
to try to reproduce, as a lot of things need to happen concurrently while doing 
a visiblity check.

As for your patch, I've had some time to take a look at it. For the two TODOs:

+               /* TODO Is it possible that currPage is not valid anymore? */
+               Assert(BTScanPosIsValid(so->currPos))

This Assert exists already a couple of lines earlier at the start of this 
function.

+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */

The comments in the source mention the following about this:
                 * We note the buffer's block number so that we can release the 
pin later.
                 * This allows us to re-read the buffer if it is needed again 
for hinting.
                 */
                so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
                
As we figured out earlier, so->currPos.buf gets set to invalid when we release 
the pin by the unpin macro. So, if we don't store currPage number somewhere 
else, we cannot obtain the pin again if we need it during killitems. I think 
that's the reason that currPage is stored.

Other than the two TODOs in the code, I think the comments really help 
clarifying what's going on in the code - I'd be happy if this gets added.

-Floris


Reply via email to