On Fri, Nov 13, 2015 at 8:35 AM, Michael Paquier
> As well as there could be FS, OS, network problems... To come back to
> the point, my point is simply that I found surprising the sentence of
> Konstantin upthread saying that if commit fails on some of the nodes
> we should rollback the prepared transaction on all nodes. In the
> example given, in the phase after calling dtm_end_prepare, say we
> perform COMMIT PREPARED correctly on node 1, but then failed it on
> node 2 because a meteor has hit a server, it seems that we cannot
> rollback, instead we had better rolling in a backup and be sure that
> the transaction gets committed. How would you rollback the transaction
> already committed on node 1? But perhaps I missed something...
Right. In that case, we have to try to eventually get it committed everywhere.
One thing that's a bit confusing about this XTM interface is what
"COMMIT" actually means. The idea is that on the standby server we
will call some DTM-provided function and pass it a token. Then we
will begin and commit a transaction. But presumably the commit does
not actually commit, because if it's a single transaction on all nodes
then the commit can't be completed until all work is done all nodes.
So my guess is that the COMMIT here is intended to behave more like a
PREPARE, but this is not made explicit.
>> One point I'd like to mention is that it's absolutely critical to
>> design this in a way that minimizes network roundtrips without
>> compromising correctness. XC's GTM proxy suggests that they failed to
>> do that. I think we really need to look at what's going to be on the
>> other sides of the proposed APIs and think about whether it's going to
>> be possible to have a strong local caching layer that keeps network
>> roundtrips to a minimum. We should consider whether the need for such
>> a caching layer has any impact on what the APIs should look like.
> At this time, the number of round trips needed particularly for READ
> COMMITTED transactions that need a new snapshot for each query was
> really a performance killer. We used DBT-1 (TPC-W) which is less
> OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit
> was quickly reached with 10-20 nodes..
Yeah. I think this merits a good bit of thought. Superficially, at
least, it seems that every time you need a snapshot - which in the
case of READ COMMITTED is for every SQL statement - you need a network
roundtrip to the snapshot server. If multiple backends request a
snapshot in very quick succession, you might be able to do a sort of
"group commit" thing where you send a single request to the server and
they all use the resulting snapshot, but it seems hard to get very far
with such optimization. For example, if backend 1 sends a snapshot
request and backend 2 then realizes that it also needs a snapshot, it
can't just wait for the reply from backend 1 and use that one. The
user might have committed a transaction someplace else and then kicked
off a transaction on backend 2 afterward, expecting it to see the work
committed earlier. But the snapshot returned to backend 1 might have
been taken before that. So, all in all, this seems rather crippling.
Things are better if the system has a single coordinator node that is
also the arbiter of commits and snapshots. Then, it can always take a
snapshot locally with no network roundtrip, and when it reaches out to
a shard, it can pass along the snapshot information with the SQL query
(or query plan) it has to send anyway. But then the single
coordinator seems like it becomes a bottleneck. As soon as you have
multiple coordinators, one of them has got to be the arbiter of global
ordering, and now all of the other coordinators have to talk to it
Maybe I'm thinking of this too narrowly by talking about snapshots;
perhaps there are other ways of ensuring whatever level of transaction
isolation we want to have here. But I'm not sure it matters that much
- I don't see any way for the sees-the-effects-of relation on the set
of all transactions to be a total ordering without some kind of
central arbiter of the commit ordering. Except for
perfectly-synchronized timestamps, but I don't think that's really
physically possible anyway.
The Enterprise PostgreSQL Company
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: