On 1/8/15, 12:00 PM, Kevin Grittner wrote:
Robert Haas <robertmh...@gmail.com> wrote:
On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgri...@ymail.com> wrote:
Robert Haas <robertmh...@gmail.com> wrote:
Andres is talking in my other ear suggesting that we ought to
reuse the 2PC infrastructure to do all this.
If you mean that the primary transaction and all FDWs in the
transaction must use 2PC, that is what I was saying, although
apparently not clearly enough. All nodes *including the local one*
must be prepared and committed with data about the nodes saved
safely off somewhere that it can be read in the event of a failure
of any of the nodes *including the local one*. Without that, I see
this whole approach as a train wreck just waiting to happen.
Clearly, all the nodes other than the local one need to use 2PC. I am
unconvinced that the local node must write a 2PC state file only to
turn around and remove it again almost immediately thereafter.
The key point is that the distributed transaction data must be
flagged as needing to commit rather than roll back between the
prepare phase and the final commit. If you try to avoid the
PREPARE, flagging, COMMIT PREPARED sequence by building the
flagging of the distributed transaction metadata into the COMMIT
process, you still have the problem of what to do on crash
recovery. You really need to use 2PC to keep that clean, I think.
If we had an independent transaction coordinator then I agree with you Kevin. I
think Robert is proposing that if we are controlling one of the nodes that's
participating as well as coordinating the overall transaction that we can take
some shortcuts. AIUI a PREPARE means you are completely ready to commit. In
essence you're just waiting to write and fsync the commit message. That is in
fact the state that a coordinating PG node would be in by the time everyone
else has done their prepare. So from that standpoint we're OK.
Now, as soon as ANY of the nodes commit, our coordinating node MUST be able to
commit as well! That would require it to have a real prepared transaction of
it's own created. However, as long as there is zero chance of any other
prepared transactions committing before our local transaction, that step isn't
actually needed. Our local transaction will either commit or abort, and that
will determine what needs to happen on all other nodes.
I'm ignoring the question of how the local node needs to store info about the
other nodes in case of a crash, but AFAICT you could reliably recover manually
from what I just described.
I think the question is: are we OK with "going under the skirt" in this
fashion? Presumably it would provide better performance, whereas forcing ourselves to eat
our own 2PC dogfood would presumably make it easier for someone to plugin an external
coordinator instead of using our own. I think there's also a lot to be said for getting a
partial implementation of this available today (requiring manual recovery), so long as
it's not in core.
BTW, I found
https://www.cs.rutgers.edu/~pxk/417/notes/content/transactions.html a useful
read, specifically the 2PC portion.
I'm not really clear on the mechanism that is being proposed for
doing this, but one way would be to have the PREPARE of the local
transaction be requested explicitly and to have that cause all FDWs
participating in the transaction to also be prepared. (That might
be what Andres meant; I don't know.)
We want this to be client-transparent, so that the client just says
COMMIT and everything Just Works.
What about the case where one or more nodes doesn't support 2PC.
Do we silently make the choice, without the client really knowing?
We abort. (Unless we want to have a running_with_scissors GUC...)
That doesn't strike me as the
only possible mechanism to drive this, but it might well be the
simplest and cleanest. The trickiest bit might be to find a good
way to persist the distributed transaction information in a way
that survives the failure of the main transaction -- or even the
abrupt loss of the machine it's running on.
I'd be willing to punt on surviving a loss of the entire machine. But
I'd like to be able to survive an abrupt reboot.
As long as people are aware that there is an urgent need to find
and fix all data stores to which clusters on the failed machine
were connected via FDW when there is a hard machine failure, I
guess it is OK. In essence we just document it and declare it to
be somebody else's problem. In general I would expect a
distributed transaction manager to behave well in the face of any
single-machine failure, but if there is one aspect of a
full-featured distributed transaction manager we could give up, I
guess that would be it.
ISTM that one option here would be to "simply" write and sync WAL record(s) of
all externally prepared transactions. That would be enough for a hot standby to find all
the other servers and tell them to either commit or abort, based on whether our local
transaction committed or aborted. If you wanted, you could even have the standby be
responsible for telling all the other participants to commit...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers