On 31 Jan. 2017 19:29, "Michael Paquier" <michael.paqu...@gmail.com> wrote:
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <cr...@2ndquadrant.com> wrote:
> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
> wal_level >= logical I don't think that's the end of the world. But
> since we already have almost everything we need in memory, why not
> just stash the gid on ReorderBufferTXN?
I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding.
TL;DR: this lets us decode the xact after prepare but before commit so
decoding/replay outcomes can affect the commit-or-abort decision.
The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
That's where you've misunderstood - it isn't committed yet. The point or
this change is to allow us to do logical decoding at the PREPARE
TRANSACTION point. The xact is not yet committed or rolled back.
This allows the results of logical decoding - or more interestingly results
of replay on another node / to another app / whatever to influence the
commit or rollback decision.
Stas wants this for a conflict-free logical semi-synchronous replication
multi master solution. At PREPARE TRANSACTION time we replay the xact to
other nodes, each of which applies it and PREPARE TRANSACTION, then replies
to confirm it has successfully prepared the xact. When all nodes confirm
the xact is prepared it is safe for the origin node to COMMIT PREPARED. The
other nodes then see hat the first node has committed and they commit too.
Alternately if any node replies "could not replay xact" or "could not
prepare xact" the origin node knows to ROLLBACK PREPARED. All the other
nodes see that and rollback too.
This makes it possible to be much more confident that what's replicated is
exactly the same on all nodes, with no after-the-fact MM conflict
resolution that apps must be aware of to function correctly.
To really make it rock solid you also have to send the old and new values
of a row, or have row versions, or send old row hashes. Something I also
want to have, but we can mostly get that already with REPLICA IDENTITY FULL.
It is of interest to me because schema changes in MM logical replication
are more challenging awkward and restrictive without it. Optimistic
conflict resolution doesn't work well for schema changes and once the
conflciting schema changes are committed on different nodes there is no
going back. So you need your async system to have a global locking model
for schema changes to stop conflicts arising. Or expect the user not to do
anything silly / misunderstand anything and know all the relevant system
limitations and requirements... which we all know works just great in
practice. You also need a way to ensure that schema changes don't render
committed-but-not-yet-replayed row changes from other peers nonsensical.
The safest way is a barrier where all row changes committed on any node
before committing the schema change on the origin node must be fully
replayed on every other node, making an async MM system temporarily sync
single master (and requiring all nodes to be up and reachable). Otherwise
you need a way to figure out how to conflict-resolve incoming rows with
missing columns / added columns / changed types / renamed tables etc which
is no fun and nearly impossible in the general case.
2PC decoding lets us avoid all this mess by sending all nodes the proposed
schema change and waiting until they all confirm successful prepare before
committing it. It can also be used to solve the row compatibility problems
with some more lazy inter-node chat in logical WAL messages.
I think the purpose of having the GID available to the decoding output
plugin at PREPARE TRANSACTION time is that it can co-operate with a global
transaction manager that way. Each node can tell the GTM "I'm ready to
commit [X]". It is IMO not crucial since you can otherwise use a (node-id,
xid) tuple, but it'd be nice for coordinating with external systems,
simplifying inter node chatter, integrating logical deocding into bigger
systems with external transaction coordinators/arbitrators etc. It seems
pretty silly _not_ to have it really.
Personally I don't think lack of access to the GID justifies blocking 2PC
logical decoding. It can be added separately. But it'd be nice to have
especially if it's cheap.