On Thu, Nov 17, 2016 at 10:54 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>>> I think this comment is saying that we'll release the pin on the
>>> primary bucket page for now, and then reacquire it later if the user
>>> reverses the scan direction. But that doesn't sound very safe,
>>> because the bucket could be split in the meantime and the order in
>>> which tuples are returned could change. I think we want that to
>>> remain stable within a single query execution.
>> Isn't that possible even without the patch? Basically, after reaching
>> end of forward scan and for doing backward *all* scan, we need to
>> perform portal rewind which will in turn call hashrescan where we will
>> drop the lock on bucket and then again when we try to move cursor
>> forward we acquire lock in _hash_first(), so in between when we don't
>> have the lock, the split could happen and next scan results could
> Well, the existing code doesn't drop the heavyweight lock at that
> location, but your patch does drop the pin that serves the same
> function, so I feel like there must be some difference.
Yes, but I am not sure if existing code is right. Consider below scenario,
Declare c cursor for select * from t4 where c1=1;
Fetch forward all from c; --here shared heavy-weight lock count becomes 1
Fetch prior from c; --here shared heavy-weight lock count becomes 2
close c; -- here, lock release will reduce the lock count and shared
heavy-weight lock count becomes 1
Now, if we try to insert from another session, such that it leads to
bucket-split of the bucket for which session-1 had used a cursor, it
will wait for session-1. The insert can only proceed after session-1
performs the commit. I think after the cursor is closed in session-1,
the insert from another session should succeed, don't you think so?
>>> Come to think of it, I'm a little worried about the locking in
>>> _hash_squeezebucket(). It seems like we drop the lock on each "write"
>>> bucket page before taking the lock on the next one. So a concurrent
>>> scan could get ahead of the cleanup process. That would be bad,
>>> wouldn't it?
>> Yeah, that would be bad if it happens, but no concurrent scan can
>> happen during squeeze phase because we take an exclusive lock on a
>> bucket page and maintain it throughout the operation.
> Well, that's completely unacceptable. A major reason the current code
> uses heavyweight locks is because you can't hold lightweight locks
> across arbitrary amounts of work -- because, just to take one example,
> a process holding or waiting for an LWLock isn't interruptible. The
> point of this redesign was to get rid of that, so that LWLocks are
> only held for short periods. I dislike the lock-chaining approach
> (take the next lock before releasing the previous one) quite a bit and
> really would like to find a way to get rid of that, but the idea of
> holding a buffer lock across a complete traversal of an unbounded
> number of overflow buckets is far worse. We've got to come up with a
> design that doesn't require that, or else completely redesign the
> bucket-squeezing stuff.
I think we can use the idea of lock-chaining (take the next lock
before releasing the previous one) for squeeze-phase to solve this
issue. Basically for squeeze operation, what we need to take care is
that there shouldn't be any scan before we start the squeeze and then
afterward if the scan starts, it should be always behind write-end of
a squeeze. If we follow this, then there shouldn't be any problem
even for backward scans because to start backward scans, it needs to
start with the first bucket and reach last bucket page by locking each
bucket page in read mode.
> (Would it make any sense to change the order of the hash index patches
> we've got outstanding? For instance, if we did the page-at-a-time
> stuff first, it would make life simpler for this patch in several
> ways, possibly including this issue.)
I agree that page-at-a-time can help hash indexes, but I don't think
it can help with this particular issue of squeeze operation. While
cleaning dead-tuples, it would be okay even if scan went ahead of
cleanup (considering we have page-at-a-time mode), but for squeeze, we
can't afford that because it can move some tuples to a prior bucket
page and scan would miss those tuples. Also, page-at-a-time will help
cleaning tuples only for MVCC scans (it might not help for unlogged
tables scan or non-MVCC scans). Another point is that we don't have a
patch for page-at-a-time scan ready at this stage.
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: