On Wed, Jun 22, 2016 at 8:44 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Jun 22, 2016 at 5:10 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> > Insertion will happen by scanning the appropriate bucket and needs to
>>> > retain pin on primary bucket to ensure that concurrent split doesn't
>>> > happen,
>>> > otherwise split might leave this tuple unaccounted.
>>> What do you mean by "unaccounted"?
>> It means that split might leave this tuple in old bucket even if it can be
>> moved to new bucket. Consider a case where insertion has to add a tuple on
>> some intermediate overflow bucket in the bucket chain, if we allow split
>> when insertion is in progress, split might not move this newly inserted
>>> I think this is basically correct, although I don't find it to be as
>>> clear as I think it could be. It seems very clear that any operation
>>> which potentially changes the order of tuples in the bucket chain,
>>> such as the squeeze phase as currently implemented, also needs to
>>> exclude all concurrent scans. However, I think that it's OK for
>>> vacuum to remove tuples from a given page with only an exclusive lock
>>> on that particular page.
>> How can we guarantee that it doesn't remove a tuple that is required by scan
>> which is started after split-in-progress flag is set?
> If the tuple is being removed by VACUUM, it is dead. We can remove
> dead tuples right away, because no MVCC scan will see them. In fact,
> the only snapshot that will see them is SnapshotAny, and there's no
> problem with removing dead tuples while a SnapshotAny scan is in
> progress. It's no different than heap_page_prune() removing tuples
> that a SnapshotAny sequential scan was about to see.
While again thinking about this case, it seems to me that we need a
cleanup lock even for dead tuple removal. The reason for the same is
that scans that return multiple tuples always restart the scan from
the previous offset number from which they have returned last tuple.
Now, consider the case where the first tuple is returned from offset
number-3 in page and after that another backend removes the
corresponding tuple from heap and vacuum also removes the dead tuple
corresponding to offset-3. When the scan will try to get the next
tuple, it will start from offset-3 which can lead to incorrect
Now, one way to solve above problem could be if we change scan for
hash indexes such that it works page at a time like we do for btree
scans (refer BTScanPosData and comments on top of it). This has an
additional advantage that it will reduce lock/unlock calls for
retrieving tuples from a page. However, I think this solution can only
work for MVCC scans. For non-MVCC scans, still there is a problem,
because after fetching all the tuples from a page, when it tries to
check the validity of tuples in heap, we won't be able to detect if
the old tuple is deleted and a new tuple has been placed at that
location in heap.
I think what we can do to solve this for non-MVCC scans is that allow
vacuum to always take a cleanup lock on a bucket and MVCC-scans will
release both the lock and pin as it proceeds. Non-MVCC scans and
scans that are started during split-in-progress will release the lock,
but not a pin on primary bucket. This way, we can allow vacuum to
proceed even if there is a MVCC-scan going on a bucket if it is not
started during a bucket split operation. For btree code, we do
something similar, which means that vacuum always take cleanup lock on
a bucket and non-MVCC scan retains a pin on the bucket.
The insertions should work as they are currently in patch, that is
they always need to retain a pin on primary bucket to avoid the
concurrent split problem as mentioned above (refer the first paragraph
discussion of this mail).
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: