On Tue, Nov 19, 2013 at 5:13 AM, Heikki Linnakangas
<hlinnakan...@vmware.com> wrote:
> Ok. Which use case are you targeting during this initial effort, batch
> updates or small OLTP transactions?

OLTP transactions are probably my primary concern. I just realized
that I wasn't actually very clear on that point in my most recent
e-mail -- my apologies. What we really need for batching, and what we
should work towards in the medium term is MERGE, where a single table
scan does everything.

However, I also care about facilitating conflict resolution in
multi-master replication systems, so I think we definitely ought to
consider that carefully if at all possible. Incidentally, Andres said
a few weeks back that he thinks that what I've proposed ought to be
only exposed to C code, owing to the fact that it necessitates the
visibility trick (actually, I think UPSERT does generally, but what
I've done has, I suppose, necessitated making it more explicit/general
- i.e. modifications are added to HeapTupleSatisfiesMVCC()). I don't
understand what difference it makes to only exposed it at the C level
- what I've proposed in this area is either correct or incorrect
(Andres mentioned the Halloween problem). Furthermore, I presume that
it's broadly useful to have Bucardo-style custom conflict resolution
policies, without people having to get their hands dirty with C, and I
think having this at the SQL level helps there. Plus, as I've said
many times, the flexibility this syntax offers is likely to be broadly
useful for ordinary SQL clients - this is almost as good as SQL MERGE
for many cases.

>> Seems like an awful lot of additional mechanism.
> Not really. Once you have the code in place to do the kill-inserted-tuple
> dance on a conflict, all you need is to do an extra index search before it.
> And once you have that, it's not hard to add some kind of a heuristic to
> either do the pre-check or skip it.


> I probably shouldn't have mentioned markpos/restrpos, you're right that it's
> not a good idea to conflate that with index insertion. Nevertheless, some
> kind of an API for doing a duplicate-key check prior to insertion, and
> remembering the location for the actual insert later, seems sensible. It's
> certainly no more of a modularity violation than the value-locking scheme
> you're proposing.

I'm not so sure - in principle, any locking implementation can be used
by any conceivable amcanunique indexing method. The core system knows
that it isn't okay to sit on them all day long, but that doesn't seem
very onerous.

>> I'm certainly not opposed to making something like this work for
>> exclusion constraints. Certainly, I want this to be as general as
>> possible. But I don't think that it needs to be a blocker, and I don't
>> think we gain anything in code footprint by addressing that by being
>> as general as possible in our approach to the basic concurrency issue.
>> After all, we're going to have to repeat the basic pattern in multiple
>> modules.
> Well, I don't know what to say. I *do* have a hunch that we'd gain much in
> code footprint by making this general. I don't understand what pattern you'd
> need to repeat in multiple modules.

Now that I see this rough patch, I better appreciate what you mean. I
withdraw this objection.

> Here's a patch, implementing a rough version of the scheme I'm trying to
> explain. It's not as polished as yours, but it ought to be enough to
> evaluate the code footprint and performance. It doesn't make any changes to
> the indexam API, and it works the same with exclusion constraints and unique
> constraints. As it stands, it doesn't leave bloat behind, except when a
> concurrent insertion with a conflicting key happens between the first
> "pre-check" and the actual insertion. That should be rare in practice.
> What have you been using to performance test this?

I was just testing my patch against a custom pgbench workload,
involving running upserts against a table from a fixed range of PK
values. It's proven kind of difficult to benchmark this in the way
that pgbench has proved useful for in the past. Pretty soon the
table's PK range is "saturated", so they're all updates, but on the
other hand how do you balance the INSERT or UPDATE case?

Multiple unique indexes are the interesting case for comparing both
approaches. I didn't really worry about performance so much as
correctness, and for multiple unique constraints your approach clearly
falls down, as explained below.

>> Is it even useful to lock multiple rows if we can't really
>> update them, because they'll overlap each other when all updated with
>> the one value?
> Hmm. I think what you're referring to is the case where you try to insert a
> row so that it violates an exclusion constraint, and in a way that it
> conflicts with a large number of existing tuples. For example, if you have a
> calendar application with a constraint that two reservations must not
> overlap, and you try to insert a new reservation that covers, say, a whole
> decade.


> That's not a problem for ON DUPLICATE KEY IGNORE, as you just ignore the
> conflict and move on. For ON DUPLICATE KEY LOCK FOR UPDATE, I guess we would
> need to handle a large TID array. Or maybe we can arrange it so that the
> tuples are locked as we scan them, without having to collect them all in a
> large array.
> (the attached patch only locks the first existing tuple that conflicts; that
> needs to be fixed)

I'm having a hard time seeing how ON DUPLICATE KEY LOCK FOR UPDATE is
of very much use to exclusion constraints at all. Perhaps I lack
imagination here. However, ON DUPLICATE KEY IGNORE certainly *is*
useful with exclusion constraints, and I'm not dismissive of that.

I think we ought to at least be realistic about the concerns that
inform your approach here. I don't think that making this work for
exclusion constraints is all that compelling; I'll take it, I guess
(not that there is obviously a dichotomy between doing btree locking
and doing ECs too), but I doubt people put "overlaps" operators in the
predicates of DML very often *at all*, and doubt even more that there
is actual demand for upserting there. I think that the reason that you
prefer this design is almost entirely down to possible hazards with
btree locking around what I've done (or, indeed anything that
approximates what I've done); maybe that's so obvious that you didn't
even occur to you to mention it, but I think it should be
acknowledged. I don't think that using index locking of *some* form is
unreasonable. Certainly, I think that from reading the literature
(e.g. [1]) one can find evidence that btree page index locking as part
of value locking seems like a common technique in many popular RDBMSs,
and presumably forms an important part of their SQL MERGE
implementations. As it says in that paper:

Thus, non-leaf pages do not require locks and are protected by latches
only. The remainder of this paper focuses on locks.

They talk here about a very traditional System-R architecture -
"Assumptions about the database environment are designed to be very
traditional". Latches here are basically equivalent to our buffer
locks, and what they call locks we call heavyweight locks. So I'm
pretty sure many other *traditional* systems handle value locking by
escalating a "latch" to a leaf-page-level heavyweight lock (it's often
more granular too). I think that the advantages are fairly

I think that "4.1 Locks on keys and ranges" of this paper is interesting.

I've also found a gentler introduction to traditional btree key
locking [2]. In that paper, section "5 Protecting a B-tree’s logical
contents" it is said:

Latches must be managed carefully in key range locking if lockable
resources are defined by keys that may be deleted if not protected.
Until the lock request is inserted into the lock manager’s data
structures, the latch on the data structure in the buffer pool is
required to ensure the existence of the key value. On the other hand,
if a lock cannot be granted immediately, the thread should not hold a
latch while the transaction waits. Thus, after waiting for a key value
lock, a transaction must repeat its root-to-leaf search for the key.

So I strongly suspect that some other systems have found it useful to
escalate from a latch (buffer/page lock) to a lock (heavyweight lock).

I have some concerns about what you've done that may limit my
immediate ability to judge performance, and the relative merits of
both approaches generally. Now, I know you just wanted to sketch
something out, and that's fine, but I'm only sharing my thoughts. I am
particularly worried about the worst case (for either approach),
particularly with more than 1 unique index. I am also worried about
livelock hazards (again, in particular with more than 1 index) - I am
not asserting that they exist in your patch, but they are definitely
more difficult to reason about. Value locking works because once a
page lock is acquired, all unique indexes are inserted into. Could you
have two upserters livelock each other with two unique indexes with
1:1 correlated values in practice (i.e. 2 unique indexes that might
almost work as 1 composite index)? That is a reasonable usage of
upsert, I think.

We never wait on another transaction if there is a conflict when
inserting - we just do the usual UNIQUE_CHECK_PARTIAL thing (we don't
wait for other xact during btree insertion). This breaks the IGNORE
case (how does it determine the final outcome of the transaction that
inserted what may be a conflict, iff the conflict was only found
during insertion?), which would probably be fine for our purposes if
that were the only issue, but I have concerns about its effects on the
ON DUPLICATE KEY LOCK FOR UPDATE case too. I don't like that an
upserter's ExecInsertIndexTuples() won't wait on other xids generally,
I think. Why should the code btree-insert even though it knows it's
going to kill the heap tuple? It makes things very hard to reason

If you are just mostly thinking about exclusion constraints here, then
I'm not sure that even at this point that it's okay that the IGNORE
case doesn't work there, because IGNORE is the only thing that makes
much sense for exclusion constraints.

The unacceptable-deadlocking-pattern generally occurs when we try to
lock two different row versions. Your patch is fairly easy to make

Regarding this:

                 * At this point we have either a conflict or a potential 
conflict. If
                 * we're not supposed to raise error, just return the fact of 
                 * potential conflict without waiting to see if it's real.
                if (errorOK && !wait)
                        conflict = true;
                        if (conflictTid)
                                *conflictTid = tup->t_self;

Don't we really just have only a potential conflict? Even if
conflictTid is committed?

I think it's odd that you insert btree index tuples without ever
worrying about waiting (which is what breaks the IGNORE case, you
might say). UNIQUE_CHECK_PARTIAL never gives an xid to wait on from
within _bt_check_unique(). Won't that itself make other sessions block
pending the outcome of our transaction (in non-upserting
ExecInsertIndexTuples(), or in ExecCheckIndexConstraints())? Could
that be why your patch deadlocks unreasonable (that is, in the way
you've already agreed, in your most recent mail, isn't okay)?

Isn't it only already okay that UNIQUE_CHECK_PARTIAL might do that for
deferred unique indexes because of re-checking, which may then abort
the xact?

How will this work?:

                 * XXX: If we know or assume that there are few duplicates, it 
                 * be better to skip this, and just optimistically proceed with 
                 * insertion below. You would then leave behind some garbage 
when a
                 * conflict happens, but if it's rare, it doesn't matter much. 
                 * kind of heuristic might be in order here, like stop doing 
                 * pre-checks if the last 100 insertions have not been 

...when you consider that the only place a tid can come from is this pre-check?

Anyway, consider the following simple test-case of your patch.

postgres=# create unlogged table foo
a int4 primary key,
b int4 unique

If I run the attached pgbench script like this:

pg@hamster:~/pgbench-tools/tests$ pgbench -f upsert.sql -n -c 50 -T 20

I can get it to deadlock (and especially to throw unique constraint
violations) like crazy.  Single unique indexes seemed okay, though I
have my doubts that only allowing one unique index gets us far, or
that it will be acceptable to have the user specify a unique index in
DML or something. I discussed this with Robert in relation to his
design upthread. Multiple unique constraints were *always* the hard
case. I mean, my patch only really does something unconventional
*because* of that case, really. One unique index is easy.

Leaving discussion of value locking aside, just how rough is this
revision of yours? What do you think of certain controversial aspects
of my design that remain unchanged, such as the visibility trick (as
actually implemented, and/or just in principle)? What about the syntax
itself? It is certainly valuable to have additional MERGE-like
functionality above and beyond the basic "upsert", not least for
multi-master conflict resolution with complex resolution policies, and
this syntax gets us much of that.

How would you feel about making it possible for the UPDATE to use a
tidscan, by projecting out the tid that caused a conflict, as a
semi-documented optimization? It might be unfortunate if someone tried
to UPDATE based on that ctid twice, but that is a less common
requirement. It is kind of an abuse of notation, because of course
you're not supposed to be projecting out the conflict-causer but the
rejects, but perhaps we can live with that, if we can live with the
basic idea.

I'm sorry if my thoughts here are not fully developed, but it's hard
to pin this stuff down. Especially since I'm guessing what is and
isn't essential to your design in this rough sketch.


[1] http://zfs.informatik.rwth-aachen.de/btw2007/paper/p18.pdf

[2] http://www.hpl.hp.com/techreports/2010/HPL-2010-9.pdf
Peter Geoghegan
\set extent 10 * :scale
\setrandom rec 1 :extent

with rej as(insert into foo(a, b) values(:rec, :rec * random()) on duplicate key lock for update returning rejects *) update foo set b = rej.b from rej where foo.a = rej.a;
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to