On Mon, Nov 18, 2013 at 6:44 AM, Heikki Linnakangas
<hlinnakan...@vmware.com> wrote:
> I think it's important to recap the design goals of this.

Seems reasonable to list them out.

> * It should be usable and perform well for both large batch updates and
> small transactions.

I think that that's a secondary goal, a question to be considered but
perhaps deferred during this initial effort. I agree that it certainly
is important.

> * It should perform well both when there are no duplicates, and when there
> are lots of duplicates

I think this is very important.

> And from that follows some finer requirements:
>
> * Performance when there are no duplicates should be close to raw INSERT
> performance.
>
> * Performance when all rows are duplicates should be close to raw UPDATE
> performance.
>
> * We should not leave behind large numbers of dead tuples in either case.

I agree with all that.

> Anything else I'm missing?

I think so, yes. I'll add:

* Should not deadlock unreasonably.

If the UPDATE case is to work and perform almost as well as a regular
UPDATE, that must mean that it has essentially the same
characteristics as plain UPDATE. In particular, I feel fairly strongly
that it is not okay for upserts to deadlock with each other unless the
possibility of each transaction locking multiple rows (in an
inconsistent order) exists. I don't want to repeat the mistakes of
MySQL here. This is a point that I stressed to Robert on a previous
occasion [1]. It's why value locks and row locks cannot be held at the
same time. Incidentally, that implies that all alternative schemes
involving bloat will bloat once per attempt, I believe.

I'll also add:

* Should anticipate a day when Postgres needs plumbing for SQL MERGE,
which is still something we want, particularly for batch operations. I
realize that the standard doesn't strictly require MERGE to handle the
concurrency issues, but even still I don't think that an
implementation that doesn't is practicable - does such an
implementation currently exist in any other system?

> What about exclusion constraints? I'd like to see this work for them as
> well. Currently, exclusion constraints are checked after the tuple is
> inserted, and you abort if the constraint was violated. We could still
> insert the heap and index tuples first, but instead of aborting on
> violation, we would kill the heap tuple we already inserted and retry. There
> are some complications there, like how to wake up any other backends that
> are waiting to grab a lock on the tuple we just killed, but it seems doable.

I agree that it's at least doable.

> That would, however, perform badly and leave garbage behind if there are
> duplicates. A refinement of that would be to first check for constraint
> violations, then insert the tuple, and then check again. That would avoid
> the garbage in most cases, but would perform much more poorly when there are
> no duplicates, because it needs two index scans for every insertion. A
> further refinement would be to keep track of how many duplicates there have
> been recently, and switch between the two strategies based on that.

Seems like an awful lot of additional mechanism.

> That cost of doing two scans could be alleviated by using markpos/restrpos
> to do the second scan. That is presumably cheaper than starting a whole new
> scan with the same key. (markpos/restrpos don't currently work for non-MVCC
> snapshots, so that'd need to be fixed, though)

Well, it seems like we could already use a "pick up where you left
off" mechanism in the case of regular btree index tuple insertions
into unique indexes -- after all, we don't do that in the event of
blocking pending the outcome of the other transaction (that inserted a
duplicate that we need to conclusively know has or has not committed)
today. The fact that this doesn't already exist leaves me less than
optimistic about the prospect of making it work to facilitate a scheme
such as the one you describe here. (Today we still need to catch a
committed version of the tuple that would make our tuple a duplicate
from a fresh index scan, only *after* waiting for a transaction to
commit/abort at the end of our original index scan). So we're already
pretty naive about this, even though it would pay to not be.

Making something like markpos work for the purposes of an upsert
implementation seems not only hard, but also like a possible
modularity violation. Are we not unreasonably constraining the
implementation going forward? My patch respects the integrity of the
am abstraction, and doesn't really add any knowledge to the core
system about how amcanunique index methods might go about implementing
the new "amlock" method. The core system worries a bit about the "low
level locks" (as it naively refers to value locks), and doesn't
consider that it has the right to hold on to them for more than an
instant, but that's about it. Plus we don't have to worry about
whether something does or does not work for a certain snapshot type
with my approach, because as with the current unique index btree
coding, it operates at a lower level than that, and does not need to
consider visibility as such.

The markpos and restpos am methods only called for regular index
(only) scans, that don't need to worry about things that are not
visible. Of course, upsert needs to worry about
invisible-but-conclusively-live things. This seems much harder, and
basically implies value locking of some kind, if I'm not mistaken. So
have you really gained anything?

So what I've done, aside from being, as you say below, close to
optimal, is in a sense defined in terms of existing, well-established
abstractions. I feel it's easier to reason about the implications of
holding value locks (whatever the implementation) for longer and
across multiple operations than it is to do all this instead. What
I've done with locking is scary, but not as scary as the worst case of
alternative implementations.

> And that detour with exclusion constraints takes me back to the current
> patch :-). What if you implemented the unique check in a similar fashion too
> (when doing INSERT ON DUPLICATE KEY ...)? First, scan for a conflicting key,
> and mark the position. Then do the insertion to that position. If the
> insertion fails because of a duplicate key (which was inserted after we did
> the first scan), mark the heap tuple as dead, and start over. The indexam
> changes would be quite similar to the changes you made in your patch, but
> instead of keeping the page locked, you'd only hold a pin on the target page
> (if even that). The first indexam call would check that the key doesn't
> exist, and remember the insert position. The second call would re-find the
> previous position, and insert the tuple, checking again that there really
> wasn't a duplicate key violation. The locking aspects would be less scary
> than your current patch.
>
> I'm not sure if that would perform as well as your current patch. I must
> admit your current approach is pretty optimal performance-wise. But I'd like
> to see it, and that would be a solution for exclusion constraints in any
> case.

I'm certainly not opposed to making something like this work for
exclusion constraints. Certainly, I want this to be as general as
possible. But I don't think that it needs to be a blocker, and I don't
think we gain anything in code footprint by addressing that by being
as general as possible in our approach to the basic concurrency issue.
After all, we're going to have to repeat the basic pattern in multiple
modules.

With exclusion constraints, we'd have to worry about a single slot
proposed for insertion violating (and therefore presumably obliging us
to lock) every row in the table. Are we going to have a mechanism for
spilling a tid array potentially sized in gigabytes to disk (relating
to just one slot proposed for insertion)? Is it principled to have
that one slot project out rejects consisting of (say) the entire
table? Is it even useful to lock multiple rows if we can't really
update them, because they'll overlap each other when all updated with
the one value? These are complicated questions, and frankly I don't
have the bandwidth to answer them too soon. I just want to implement a
feature that there is obviously huge pent up demand for, that has in
the past put Postgres at a strategic disadvantage. I don't think it is
unsound to define ON DUPLICATE KEY in terms of unique indexes. That's
how we represent uniques...it isn't spelt ON OVERLAPPING or whatever.
That seems like an addition, a nice-to-have, and maybe not even that,
because exclusion-constrained columns *aren't* keys, and people aren't
likely to want to upsert details of a booking (the typical exclusion
constraint use-case) with the booking range in the UPDATE part's
predicate. They'd just do it by key, because they'd already have a
booking number PK value or whatever.

Making this perform as well as possible is an important consideration.
All alternative approaches that involve bloat concern me, and for
reasons that I'm not sure were fully appreciated during earlier
discussion on this thread: I'm worried about the worst case, not the
average case. I am worried about a so-called "thundering herd"
scenario. You need something like LockTuple() to arbitrate ordering,
which seems complex, and like a further modularity violation. If this
is to perform well when there are lots of existing tuples to be
updated (with contention that wouldn't be considered unreasonable for
plain updates), the amount of bloat generated by a thundering herd
could be really really bad (once per attempt per "head of
cattle"/upserter) . It's hard to say for sure how much of a problem
this is, but I think it needs to be considered. It's a problem that
I'm not sure we have the tools to analyze ahead of time. It's easier
to pin down and reason about the conventional value locking stuff,
because we know how deadlocks work. We know how to do analysis of
deadlock hazards, and the surface area actually turns out to be not
too large there.

> One fairly limitation with your current approach is that the number of
> lwlocks you can hold simultaneously is limited (MAX_SIMUL_LWLOCKS == 100).
> Another limitation is that the minimum for shared_buffers is only 16.
> Neither of those is a serious problem in real applications - no-one runs
> with shared_buffers=16 and no sane schema has a hundred unique indexes, but
> it's still something to consider.

I was under the impression, based on previously feedback, that what
I've done with LWLocks was unlikely to be accepted. I proceeded under
the assumption that we'll be able to ameliorate these problems, as for
example by implementing an alternative value locking mechanism (an
SLRU?) that is similar to what I've done to date (in particular, very
cheap and fast), but without all the down-sides that concerned Robert
and Andres, and now you. As I said, I still think that's easier and
safer than all alternative approaches described to date. It just so
happens that I also believe it will perform a lot better in the
average case too, but that isn't a key advantage to my mind.

You're right that the value locking is scary. I think we need to very
carefully consider it, once I have buy-in on the basic approach. I
really do think it's the least-worst approach described to date. It
isn't like we can't discuss making it inherently less scary, but I
hesitate to do that now, given that I don't know if that discussing
will go anywhere.

Thanks for your efforts on reviewing my work here! Do you think it
would be useful at this juncture to write a patch to make the order of
locking across unique indexes well-defined? I think it may well have
independent value to get the insertion into unique indexes (that can
throw errors) out of the way when doing a regular slot insertion.
Better to abort the transaction as soon as possible.

[1] 
http://www.postgresql.org/message-id/CAM3SWZRfrw+zXe7CKt6-QTCuvKQ-Oi7gnbBOPqQsvddU=9m...@mail.gmail.com

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to