I am sorry, looks like I had incorrectly interpret Michael's comment:
Not always. If COMMIT PREPARED fails at some of the nodes but succeeds
on others, the transaction is already partially acknowledged as
committed in the cluster. Hence it makes more sense for the
coordinator to commit transactions on the remaining nodes. Before
issuing any COMMIT PREPARED queries, I guess that's fine to rollback
the transactions on all nodes though.
I am completely agree that at second stage pf 2PC, once we make a decision to commit transaction, there is no way back: we have to complete commit at all nodes. If node reports that it has successfully prepared transaction and ready to commit, then it mean that node should be able to recover it in case of failure. In my example of code in GO illustrating work with 2PC I for simplicity completely exclude login of handling errors.
Actually it should be something like this (now in C++ using pqxx):

    nontransaction srcTxn(srcCon);
    nontransaction dstTxn(dstCon);
    string transComplete;
    try {
       exec(srcTxn, "prepare transaction '%s'", gtid);
       exec(dstTxn, "prepare transaction '%s'", gtid);
       // At this moment all nodes has prepared transaction, so in principle we 
can make a decision to commit here...
       // But it is easier to wait until CSN is assigned to transaction and it 
is delivered to all nodes
       exec(srcTxn, "select dtm_begin_prepare('%s')", gtid);
       exec(dstTxn, "select dtm_begin_prepare('%s')", gtid);
       csn = execQuery(srcTxn, "select dtm_prepare('%s', 0)", gtid);
       csn = execQuery(dstTxn, "select dtm_prepare('%s', %lu)", gtid, csn);
       exec(srcTxn, "select dtm_end_prepare('%s', %lu)", gtid, csn);
       exec(dstTxn, "select dtm_end_prepare('%s' %lu)", gtid, csn);
       // If we reach this point, we make decision to commit...
       transComplete = string("commit prepared '") + gtid + "'";
   } catch (pqxx_exception const& x) {
       // ... if not, then rollback prepared transaction
       transComplete = string("rollback prepared '") + gtid + "'";
   pipeline srcPipe(srcTxn);
   pipeline dstPipe(dstTxn);

On 11/15/2015 08:59 PM, Kevin Grittner wrote:
On Saturday, November 14, 2015 8:41 AM, Craig Ringer <cr...@2ndquadrant.com> 
On 13 November 2015 at 21:35, Michael Paquier <michael.paqu...@gmail.com> wrote:
On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas <robertmh...@gmail.com> wrote:
If the database is corrupted, there's no way to guarantee that
anything works as planned.  This is like saying that criticizing
somebody's disaster recovery plan on the basis that it will be
inadequate if the entire planet earth is destroyed.

Once all participating servers return "success" from the "prepare"
phase, the only sane way to proceed is to commit as much as
possible.  (I guess you still have some room to argue about what to
do after an error on the attempt to commit the first prepared
transaction, but even there you need to move forward if it might
have been due to a communications failure preventing a success
indication for a commit which was actually successful.)  The idea
of two phase commit is that failure to commit after a successful
prepare should be extremely rare.

As well as there could be FS, OS, network problems... To come back to
the point, my point is simply that I found surprising the sentence of
Konstantin upthread saying that if commit fails on some of the nodes
we should rollback the prepared transaction on all nodes. In the
example given, in the phase after calling dtm_end_prepare, say we
perform COMMIT PREPARED correctly on node 1, but then failed it on
node 2 because a meteor has hit a server, it seems that we cannot
rollback, instead we had better rolling in a backup and be sure that
the transaction gets committed. How would you rollback the transaction
already committed on node 1? But perhaps I missed something...

The usual way this works in an XA-like model is:

In phase 1 (prepare transaction, in Pg's spelling), failure on
any node triggers a rollback on all nodes.

In phase 2 (commit prepared), failure on any node causes retries
until it succeeds, or until the admin intervenes - say, to remove
that node from operation. The global xact as a whole isn't
considered successful until it's committed on all nodes.

2PC and distributed commit is well studied, including the
problems. We don't have to think this up for ourselves. We don't
have to invent anything here. There's a lot of distributed
systems theory to work with - especially when dealing with well
studied relational DBs trying to maintain ACID semantics.
Right; it is silly not to build on decades of work on theory and
practice in this area.

What is making me nervous as I watch this thread is a bit of loose
talk about the I in ACID.  There have been some points where it
seemed to be implied that we had sufficient transaction isolation
if we could get all the participating nodes using the same
snapshot.  I think I've seen some hint that there is an intention
to use distributed strict two-phase locking with a global lock
manager to achieve serializable behavior.  The former is vulnerable
to some well-known serialization anomalies and the latter was
dropped from PostgreSQL when MVCC was implemented because of its
horrible concurrency and performance characteristics, which is not
going to be less of an issue in a distributed system using that
technique.  It may be possible to implement the Serializable
Snapshot Isolation (SSI) technique across multiple cooperating
nodes, but it's not obvious exactly how that would be implemented,
and I have seen no discussion of that.

If we're going to talk about maintaining ACID semantics in this
environment, I think we need to explicitly state what level of
isolation we intend to provide, and how we intend to do that.

Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to