On Mon, Sep 23, 2013 at 12:49 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> Right, but what about XactLockTableWait() itself? It only acquires a
>> ShareLock on the xid of the got-there-first inserter that potentially
>> hasn't yet committed/aborted.
>
> That's an interesting point.  As you pointed out in later emails, that
> cases is handled for heap tuple locks, but btree uniqueness conflicts
> are a different kettle of fish.

Right.

It suits my purposes to have the value locks be held for only an
instant, because:

1) It will perform much better and deadlock much less in certain
scenarios if sessions are given leeway to not block each other across
multiple values in multiple unique indexes (i.e. we make them
"considerate", like a person with a huge shopping cart that lets
another person with one thing ahead of them in a queue, and perhaps in
doing so greatly reduces their own wait because the guy with one thing
makes the cashier immediately say 'no' to the person with all those
groceries. Ahem). I don't think that this implies any additional
anomalies at read committed, and I'm reasonably confident that this
doesn't regress things to any degree lock starvation wise - lock
starvation can only come from a bunch of inserters of the same value
that consistently abort, just like the present situation with one
unique index (I think it's better with multiple unique indexes than
with only one - more opportunities for the would-be-starved session to
hear a definitive no answer and give up).

2) It will probably be considerably easier if and when we improve on
the buffer locking stuff (by adding a value locking SLRU) to assume
that they'll be shortly held. For example, maybe it's okay that the
implementation doesn't allow page splits on value-locked pages, and
maybe that makes things much easier to reason about. If you're
determined to have a strict serial ordering of value locking *without
serialization failures*, I think what I've already said about the
interactions between row locking and value locking demonstrates that
that's close to or actually impossible. Plus, it would really suck for
performance if that SLRU had to actually swap value locks to and from
disk, which becomes a real possibility if they're really long held
(mere index scans aren't going to keep the cache warm, so the
worst-case latency for an innocent inserter into some narrow range of
values might be really bad).

Speaking of ease of implementation, how do you guarantee that the
value locking waiters get the right to insert in serial order (if
that's something that you value, which I don't at RC)? You have to fix
the same "race" that already exists when acquiring a ShareLock on an
xid, and blocking on value lock acquisition. The only possible remedy
I can see for that is to integrate heap and btree locking in a much
more intimate and therefore sketchy way. You need something like
LockTuple() to arbitrate ordering, but what, and how, and where, and
with how many buffer locks held?

Most importantly:

3) As I've already mentioned, heavy value locks (like promise tuples
or similar schemes, as opposed to actual heavyweight locks)
concomitantly increase the window in which a conflict can be created
for row locking. Most transactions last but an instant, and so the
fact that other session may already be blocked locking on the
would-be-duplicate row perhaps isn't that relevant. Doing all that
clean-up is going to give other sessions increased opportunity to lock
the row themselves, and ruin our day.

But these points are about long held value locks, not the cost of
making their acquisition relatively expensive or inexpensive (but
still more or less instantaneous), so why mention that at all? Well,
since we're blocking everyone else with our value locks, they get to
have a bad day too. All the while, they're perhaps virtually
pre-destined to find some row to lock, but the window for something to
happen to that row for that to conflict with eventual row locking (to
*unnecessarily* conflict, as for example when an innocuous HOT update
occurs) gets larger and larger as they wait longer and longer on value
locks. Loosely speaking, things get multiplicatively worse - total
gridlock is probably possible, with the deadlock detector only
breaking the gridlock up a bit if we get lucky (unless, maybe, if
value locks last until transaction end...which I think is nigh on
impossible anyway).

The bottom line is that long lasting value locks - value locks that
last the duration of a transaction and are acquired serially, while
guaranteeing that the inserter that gets all the value locks needed
itself gets to insert - have the potential to cascade horribly, in
ways that I can only really begin to reason about. That is, they do
*if* we presume that they have the interactions with row locking that
I believe they do, a belief that no one has taken issue with yet.

Even *considering* this is largely academic, though, because without
some kind of miracle guaranteeing serial ordering, a miracle that
doesn't allow for serialization failures and also doesn't seriously
slow down, for example, updates (by making them care about value locks
*before* they do anything, or in the case of HOT updates *at all*),
all of this is _impossible_. So, I say let's just do the
actually-serial-ordering for value lock acquisition with serialization
failures where we're > read committed. I've seriously considered what
it would take to do it any other way so things would work how you and
Andres expect for read committed, and it makes my head hurt, because
apart from seeming unnecessary to me, it also seems completely
hopeless.

Am I being too negative here? Well, I guess that's possible. The fact
is that it's really hard to judge, because all of this is really hard
to reason about. That's what I really don't like about it.

>> I respectfully
>> suggest that that exact definition of upsert isn't a useful one,
>> because other snapshot isolation/MVCC systems operating within the
>> same constraints must have the same issues, and yet they manage to
>> implement something that could be called upsert that people seem happy
>> with.
>
> Yeah.  I wonder how they do that.

My guess is that they have some fancy snapshot type that is used by
the equivalent of ModifyTable subplans, that is appropriately paranoid
about the Halloween problem and so on. How that actually might work is
far from clear, but it's a question that I have begun to consider. As
I said, a concern is that it would be in tension with the generalized,
composable syntax, where we don't explicitly have a "special update".
I'd really like to hear other opinions, though.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to