On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> However, when I briefly read the description in "Transaction Management in
> the R* Distributed Database Management System (C. Mohan et al)" [2], it
> seems that what Ashutosh is saying might be a correct way to proceed after
> all:

I think Ashutosh is mostly right, but I think there's a lot of room to
doubt whether the design of this patch is good enough that we should
adopt it.

Consider two possible designs.  In design #1, the leader performs the
commit locally and then tries to send COMMIT PREPARED to every standby
server afterward, and only then acknowledges the commit to the client.
In design #2, the leader performs the commit locally and then
acknowledges the commit to the client at once, leaving the task of
running COMMIT PREPARED to some background process.  Design #2
involves a race condition, because it's possible that the background
process might not complete COMMIT PREPARED on every node before the
user submits the next query, and that query might then fail to see
supposedly-committed changes.  This can't happen in design #1.  On the
other hand, there's always the possibility that the leader's session
is forcibly killed, even perhaps by pulling the plug.  If the
background process contemplated by design #2 is well-designed, it can
recover and finish sending COMMIT PREPARED to each relevant server
after the next restart.  In design #1, that background process doesn't
necessarily exist, so inevitably there is a possibility of orphaning
prepared transactions on the remote servers, which is not good. Even
if the DBA notices them, it won't be easy to figure out whether to
commit them or roll them back.

I think this thought experiment shows that, on the one hand, there is
a point to waiting for commits on the foreign servers, because it can
avoid the anomaly of not seeing the effects of your own commits.  On
the other hand, it's ridiculous to suppose that every case can be
handled by waiting, because that just isn't true.  You can't be sure
that you'll be able to wait long enough for COMMIT PREPARED to
complete, and even if that works out, you may not want to wait
indefinitely for a dead server.  Waiting for a ROLLBACK PREPARED has
no value whatsoever unless the system design is such that failing to
wait for it results in the ROLLBACK PREPARED never getting performed
-- which is a pretty poor excuse.

Moreover, there are good reasons to think that doing this kind of
cleanup work in the post-commit hooks is never going to be acceptable.
Generally, the post-commit hooks need to be no-fail, because it's too
late to throw an ERROR.  But there's very little hope that a
connection to a remote server can be no-fail; anything that involves a
network connection is, by definition, prone to failure.  We can try to
guarantee that every single bit of code that runs in the path that
sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an
ERROR, but that's going to be quite difficult to do: even palloc() can
throw an error.  And what about interrupts?  We don't want to be stuck
inside this code for a long time without any hope of the user
recovering control of the session by pressing ^C, but of course the
way that works is it throws an ERROR, which we can't handle here.  We
fixed a similar issue for synchronous replication in
9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a
WARNING in that case and then return control to the user.  But if we
do that here, all of the code that every FDW emits has to be aware of
that rule and follow it, and it just adds to the list of ways that the
user backend can escape this code without having cleaned up all of the
prepared transactions on the remote side.

It seems to me that the only way to really make this feature robust is
to have a background worker as part of the equation.  The background
worker launches at startup and looks around for local state that tells
it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
operations pending that weren't completed during the last server
lifetime, whether because of a local crash or remote unavailability.
It attempts to complete those and retries periodically.  When a new
transaction needs this type of coordination, it adds the necessary
crash-proof state and then signals the background worker.  If
appropriate, it can wait for the background worker to complete, just
like a CHECKPOINT waits for the checkpointer to finish -- but if the
CHECKPOINT command is interrupted, the actual checkpoint is

More broadly, the question has been raised as to whether it's right to
try to handle atomic commit and atomic visibility as two separate
problems.  The XTM API proposed by Postgres Pro aims to address both
with a single stroke.  I don't think that API was well-designed, but
maybe the idea is good even if the code is not.  Generally, there are
two ways in which you could imagine that a distributed version of
PostgreSQL might work.  One possibility is that one node makes
everything work by going around and giving instructions to the other
nodes, which are more or less unaware that they are part of a cluster.
That is basically the design of Postgres-XC and certainly the design
being proposed here.  The other possibility is that the nodes are
actually clustered in some way and agree on things like whether a
transaction committed or what snapshot is current using some kind of
consensus protocol.  It is obviously possible to get a fairly long way
using the first approach but it seems likely that the second one is
fundamentally more powerful: among other things, because the first
approach is so centralized, the leader is apt to become a bottleneck.
And, quite apart from that, can a centralized architecture with the
leader manipulating the other workers ever allow for atomic
visibility?  If atomic visibility can build on top of atomic commit,
then it makes sense to do atomic commit first, but if we build this
infrastructure and then find that we need an altogether different
solution for atomic visibility, that will be unfortunate.

I know I was one of the people initially advocating this approach, but
I'm no longer convinced that it's going to work out well.  I don't
mean that we should abandon all work on this topic, or even less all
discussion, but I think we should be careful not to get so sucked into
the details of perfecting this particular patch that we ignore the
bigger design questions here.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to