Hi,

This is awesome.

On 2013-08-30 15:09:59 -0700, Peter Geoghegan wrote:
> 1) The patch is only interested in unique index violations
> (specifically, violations of amcanunique access method unique
> indexes); it will not do anything with NULL constraint violations, as
> the MySQL feature does, for example. It also doesn't care about
> exclusion constraints, even though it's currently possible to defer
> exclusion constraint enforcement in a way that is analogous to how
> it's done for btree indexes.

All that seems sane to me. I very, very much do not want it to deal with
NOT NULL violations.

> [syntax]

Haven't thought about syntax at all so far, that's food for another day.

> Use-cases
> =========
> ...
> In addition to that, the feature is may be of value to clients of the
> logical change-set generation infrastructure (such as the proposed BDR
> contrib module[3]), if and when that infrastructure is finally
> committed. Currently, the BDR prototype simply balks at INSERT
> conflicts. This is because the amount of subtransactions/xids
> generated to do anything else is considered prohibitive. If Postgres
> is ever going to solve the kinds of problems that BDR proposes to
> solve well (except for the relatively narrow use-case where balking at
> INSERT conflicts is acceptable), it needs something like what I've
> proposed here. The only alternative is probably to replay a
> transaction without using subtransactions, and if that fails, remember
> that fact and wrap every single change in a subxact. This is really
> going to be painful for large transactions.

Exactly. In completely unscientific tests it reduces speed to less than
1/10 of the version without subxacts. Without a single conflict that is.

> Andres privately said some weeks back (prior to my sharing of this
> information, and even prior to writing most of the code posted here)
> that he'd like to see an INSERT...ON DUPLICATE KEY LOCK. I suppose
> that he must have meant that he'd like to see shared locks obtained on
> rows that blocked a noserter from actually inserting some of their
> proposed tuples, for the purposes of conflict resolution (which, as
> I've said, this doesn't do). Perhaps he'd like to comment on these
> issues here.

Ok, so what we would like to do is basically to follow a protocol
(simplified) like:

1) replay INSERT, using ON DUPLICATE IGNORE
2) if INSERT succeeded, be happy, no conflict (another INSERT with the
   same key might conflict though)
3) if the INSERT failed, lock tuple for UPDATE
4) if the tuple got deleted by another session, goto 1
5) check whether local tuple or the already inserted tuple wins conflict 
resolution
6) UPDATE or skip accordingly

If there's a variant of INSERT ... ON DUPLICATE that gets a FOR UPDATE
lock on the existing row this gets simpler because there's no chance
anymore we need to loop or look for other versions of the row.

Makes sense?

> Mechanism
> =========
>
> [...]
>
> From a high level, the patch works by adding something that is
> tentatively internally referred to as "speculative insertion" in
> lower-level code comments. The basic approach is:
>
> * Within ExecInsert, when inserting into a regular heap relation from
> a tuple table slot (after BEFORE triggers fire but before inserting
> the heap tuple), there is an extra function call to a new function,
> ExecLockIndexTuples().
> 
> * For each applicable amcanunique unique index, ExecLockIndexTuples()
> calls a new am method, amlock. For each unique index, we end up
> acquiring a write buffer lock in a similar fashion to _bt_doinsert.
> When we return from the new amlock function, we report whether or not
> there was a duplicate without having inserted any index tuples or
> having otherwise done something that requires a heap tuple to be
> present. I like to call this point "the point of no return", where
> individual unique indexes commit themselves to insertion (which is to
> say, the point after which a unique constraint violation becomes
> impossible but before actual index tuple insertion in _bt_doinsert
> [9]). Another duplicate key cannot be inserted by anyone else until
> that write lock is released, which it won't be until insertion or
> IGNORE, for the same reason why that's currently true once control
> reaches the point of no return. A pagesplit might need to happen
> before insertion proper of that index tuple, for example, and that's
> okay.
> 
> * If we have total consensus among all unique indexes, we proceed with
> a heap insertion of the tuple formed from the tuple table slot
> ExecInsert() was called in respect of. Yes, this means that
> potentially multiple exclusive buffer locks are held for the duration
> of the heap tuple insertion. If not, we release the locks early and
> return NULL from ExecInsert() before heap insertion ever happens (and
> certainly before index tuple insertion ever happens).
> 
> * If and when we get to ExecInsertIndexTuples(), and ultimately to
> _bt_doinsert, the new _bt_doinsert knows that for any unique indexes
> it sees, that the process has already been partially completed. It
> takes state made available from the first phase of speculative
> insertion, and finishes the job, pretty much picking up from the point
> of no return for amcanunique unique indexes (or starting from scratch
> in the usual way for non-unique amcanunique-method indexes that happen
> to be involved in speculative insertion).

Puh. It took me some time to even start to understand what you're doing
here...

While I ponder on it some more, could you explain whether I understood
something correctly? Consider the following scenario:

CREATE TABLE foo(a int UNIQUE, b int UNIQUE);

S1: INSERT INTO foo(0, 0);
S1: BEGIN;
S1: INSERT INTO foo(1, 1);
S1: SELECT pg_sleep(3600);
S2: BEGIN;
S2: INSERT INTO foo(2, 1);
S3: SELECT * FROM foo WHERE a = 0;

Won't S2 in this case exclusively lock a page in foo_a_key (the one
where 2 will reside on) for 3600s, while it's waiting for S1 to finish
while getting the speculative insertion into foo_b_key?

ISTM we could use something like a new lock level to make that work more
sensibly, basically something like LW_FOR_UPDATE which does not conflict
with _SHARED but conflicts with _EXCLUSIVE. This can then be used by the
speculative insert to "reserve" that page.
But while that would allow concurrent searched, this algorithm would
still prevent any insertion on the same page for 3600s, right?

> Deadlock hazards
> ==============
> 
> In order to avoid deadlocks, the locking stage acquires locks in
> reverse order to the usual index insertion order. So, regardless of
> how those locks are subsequently freed (either by actual insertion or
> by finishing early in the IGNORE case), we avoid buffer lock related
> deadlocks. Still, deadlock hazards are something that reviewers will
> want to pay particular attention to.
> 
> Unlike some other systems like DB2, we have always allowed BEFORE ROW
> triggers to execute arbitrary SQL. I've frequently thought this was a
> bit of a wart (e.g. [10]), and certainly not supportive of sensible,
> idiomatic trigger use, but there isn't much we can do about it at this
> stage. Right now, BEFORE ROW triggers will still fire when the new
> code decides to not do the insert. It certainly wouldn't be acceptable
> to allow before triggers to run *after* the first phase of speculative
> insertion, because they might try and access an index with an
> exclusive locked buffer, resulting in backend self-deadlock. Besides,
> we cannot really know *what* to lock until after the before triggers
> fire. That leaves us with two options, as far as I can tell:
> 
> 1. Just have users live with those semantics. This is what the patch
> does presently, and anyone who is using triggers in what I consider to
> be a sensible way doesn't have to care. For everyone else, it's a
> gotcha that they have to deal with, to be noted prominently.

I find that acceptable. They already need to deal with the fact that a
second trigger can stop insertion.

Greetings,

Andres Freund

-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to