On Thu, Feb 8, 2018 at 3:58 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>> Overall, what's the status of this patch?  Are we hung up on this
>> issue only, or are there other things?
> AFAIK there is no more technical issue in this patch so far other than
> this issue. The patch has tests and docs, and includes all stuff to
> support atomic commit to distributed transactions: the introducing
> both the atomic commit ability to distributed transactions and some
> corresponding FDW APIs, and having postgres_fdw support 2pc. I think
> this patch needs to be reviewed, especially the functionality of
> foreign transaction resolution which is re-designed before.

OK.  I'm going to give 0002 a read-through now, but I think it would
be a good thing if others also contributed to the review effort.
There is a lot of code here, and there are a lot of other patches
competing for attention.  That said, here we go:

In the documentation for pg_prepared_fdw_xacts, the first two columns
have descriptions ending in a preposition.  That's typically to be
avoided in formal writing.  The first one can be fixed by moving "in"
before "which".  The second can be fixed by changing "that" to "with
which" and then dropping the trailing "with".  The first three columns
have descriptions ending in a period; the latter two do not.  Make it
consistent with whatever the surrounding style is, or at least
internally consistent if the surrounding style varies.  Also, some of
the descriptions begin with "the" and others do not; again, seems
better to be consistent and adhere to surrounding style.  The
documentation of the serverid column seems to make no sense.  Possibly
you mean "OID of the foreign server on which this foreign transaction
is prepared"?  As it is, you use "foreign server" twice, which is why
I'm confused.

The documentation of max_prepared_foreign_transactions seems a bit
brief.  I think that if I want to be able to use 2PC for N
transactions each across K servers, this variable needs to be set to
N*K, not just N.  That's not the right way to word it here, but
somehow you probably want to explain that a single local transaction
can give rise to multiple foreign transactions and that this should be
taken into consideration when setting a value for this variable.
Maybe also include a link to where the user can find more information,
which brings me to another point: there doesn't seem to be any
general, user-facing explanation of this system.  You explain the
catalogs, the GUCs, the interface, etc. but there's nothing anywhere
that explains the overall point of the whole thing, which seems pretty
important. The closest thing you've got is a description for people
writing FDWs, but we really need a place to explain the concept to
*users*.  One idea is to add a new chapter to the "Server
Administration" section, maybe just after "Logical Replication" and
before "Regression Tests".  But I'm open to other ideas.

It's important that the documentation of the various GUCs provide
users with some clue as to how to set them.  I notice this
particularly for foreign_transaction_resolution_interval; off the top
of my head, without having read the rest of this patch, I don't know
why I shouldn't want this to be zero.  But the others could use more
explanation as well.

It is unclear from reading the documentation for GetPreparedId why
this should be the responsibility of the FDW, or how the FDW is
supposed to guarantee uniqueness.

PrepareForignTransaction is spelled wrong.  Nearby typos: prepareing,
tranasaction.  A bit further down, "can _prepare" has an unwanted
space in the middle.  Various places in this section capitalize
certain words in the middle of sentences which is not standard English
style.  For example, in "Every Foreign Data Wrapper is required..."
the word "Every" is appropriately capitalized because it begins a
sentence, but there's no reason to capitalize the others.  Likewise
for "...employs Two-phase commit protocol..." and other similar cases.

EndForeignTransaction doesn't explain what sorts of things the FDW can
legally do when called, or how this method is intended to be used.
Those seem like important details.  Especially, an FDW that tries to
do arbitrary stuff during abort cleanup will probably cause big

The fdw-transactions section of the documentation seems to imply that
henceforth every FDW must call FdwXactRegisterForeignServer, which I
think is an unacceptable API break.

It doesn't seem advisable to make this behavior depend on
max_prepared_foreign_transactions.  I think that it should be an
server-level option: use 2PC for this server, or not?  FDWs that don't
have 2PC default to "no"; but others can be set to "yes" if the user
wishes.  But we shouldn't force that decision to be on a cluster-wide

+    <xref linkend="functions-fdw-transaction-table"/> shows the functions
+    available for foreign transaction managements.


+        Resolve a foreign transaction. This function search foreign transaction

searches for

+        matching the criteria and resolve then. This function doesn't resolve

critera->arguments, resolve->resolves, doesn't->won't

+        an entry of which transaction is in-progress, or that is locked by some

a foreign transaction which is in progress, or one that is locked by some

This doesn't seem like a great API contract.  It would be nice to have
the guarantee that, if the function returns without error, all
transactions that were prepared before this function was run and which
match the given arguments are now resolved.  Skipping locked
transactions removes that guarantee.

+        This function works the same as
+        except it remove foreign transaction entry without resolving.

Explain why that's useful.

+     <entry>OID of the database that the foreign transaction resolver
process connects to</entry>

to which the ... is connected

+     <entry>Time of last resolved a foreign transaction</entry>

Time at which the process last resolved a foreign transaction

+   of foreign trasactions.

The new wait events aren't documented.

Spelling error.

+ * This module manages the transactions involving foreign servers.

Remove this.  Doesn't add any information.

+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign servers.

This too.

+ * connection is identified by oid fo foreign server and user.

fo -> of

+ * first phase doesn not succeed for whatever reason, the foreign servers

doesn -> does

But more generally:

+ * The commit is executed in two phases. In the first phase executed during
+ * pre-commit phase, transactions are prepared on all the foreign servers,
+ * which can participate in two-phase commit protocol. Transaction on other
+ * foreign servers are committed in the same phase. In the second phase, if
+ * first phase doesn not succeed for whatever reason, the foreign servers
+ * are asked to rollback respective prepared transactions or abort the
+ * transactions if they are not prepared. This process is executed by backend
+ * process that executed the first phase. If the first phase succeeds, the
+ * backend process registers ourselves to the queue in the shared
memory and then
+ * ask the foreign transaction resolver process to resolve foreign transactions
+ * that are associated with the its transaction. After resolved all foreign
+ * transactions by foreign transaction resolve process the backend wakes up
+ * and resume to process.

The only way this can be reliable, I think, is if we prepare all of
the remote transactions before committing locally and commit them
after committing locally.  Otherwise, if we crash or fail before
committing locally, our atomic commit is no longer atomic.  I think
the way this should work is: during pre-commit, we prepare the
transaction everywhere.  After committing or rolling back, we notify
the resolver process and tell it to try to commit or roll back those
transactions.  If we ask it to commit, we also tell it to notify us
when it's done, so that we can wait (interruptibly) for it to finish,
and so that we're not trying to locally do work that might fail with
an ERROR after already committed (that would confuse the client).  If
the user hits ^C, then we handle it like synchronous replication: we
emit a WARNING saying that the transaction might not be remotely
committed yet, and return success.  I see you've got that logic in
FdwXactWaitToBeResolved() so maybe this comment isn't completely in
sync with the latest version of the patch, but I think there are some
remaining ordering problems as well; see below.

I think it is, generally, confusing to describe this process as having
two phases.  For one thing, two-phase commit has two phases, and
they're not these two phases, but we're talking about them in a patch
about two-phase commit.  But also, they really aren't two phases.
Saying something has two phases means that A happens and then B
happens.  But, at least as this is presently described here, B might
not need to happen at all.  So that's not really a phase.  I think you
need a different word here, like maybe "ways", unless I'm just
misunderstanding what this whole thing is saying.

+ *    * RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions()
+ *      have been modified to go through fdw_xact->inredo entries that have
+ *      not made to disk yet.

This doesn't seem to be true.  I see no reference to these functions
being modified elsewhere in the patch.  Nor is it clear why they would
need to be modified.  For local 2PC, prepared transactions need to be
included in all snapshots that are taken.  Otherwise, if a local 2PC
transaction were committed, concurrent transactions would see the
effects of that transaction appear all at once, even though they
hadn't gotten a new snapshot.  That is the reason why we need
StandbyRecoverPreparedTransactions() before opening for hot standby.
But for FDW 2PC, even if we knew which foreign transactions were
prepared but not yet committed, we have no mechanism for preventing
those changes from being visible on the remote servers, nor do they
have any effect on local visibility.  So there's no need for this
AFAICS.  Similarly, this appears to me to be incorrect:

+        RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+                                     max_prepared_foreign_xacts,
+                                     ControlFile->max_prepared_foreign_xacts);

I might be confused here, but it seems to me that the value of
max_prepared_foreign_transactions is irrelevant to whether we can
initialize Hot Standby, because, again, those remote xacts have no
effect on our local snapshot.  Rather, the problem is that we are
*altogether unable to proceed with recovery* if this value is too low,
regardless of whether we are doing Hot Standby or not.  twophase.c
handles that by just erroring out in PrepareRedoAdd() if we run out of
slots, and insert_fdw_xact does the same thing (although that message
doesn't follow style guidelines -- no space before a colon, please!).
So it seems to me that you can just delete this code and the
associated documentation mention; these concerns are irrelevant here
and the actual failure case is otherwise handled.

+ *      We save them to disk and alos set fdw_xact->ondisk to true.
+ *    * RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions()
+                     errmsg("prepread foreign transactions are disabled"),
+                 errmsg("out of foreign trasanction resolver slots"),

More typos: alos, RecoverPreparedTrasactions, prepread, trasanction

+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include "postgres.h"

Thou shalt not have any #includes before "postgres.h"

+#include "miscadmin.h"
+#include "funcapi.h"

These should be in the main alphabetized list.  If that doesn't work,
then some header is broken.

+        if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+        {
+            fdw_conn->modified |= modify;

I suggest avoiding |= with Boolean.  It might be harmless, but it
would be just as easy to write this as if (existing conditions &&
!fdw_conn->modified) fdw_conn->modified = true, which avoids any
assumption about the bit patterns.

+        max_hash_size = max_prepared_foreign_xacts;
+        init_hash_size = max_hash_size / 2;

I think we decided that hash tables other than the lock manager should
initialize with maximum size = initial size.  See

+    if (list_length(MyFdwConnections) < 1)

How about if (MyFdwConnections) == NIL?

This occurs multiple times in the patch.

+    if (list_length(MyFdwConnections) == 0)

How about if (MyFdwConnections) == NIL?

+    if ((list_length(MyFdwConnections) > 1) ||
+        (list_length(MyFdwConnections) == 1 && (MyXactFlags &
+        return true;

I think this would be clearer written this way:

int nserverswritten = list_length(MyFdwConnections);
return nserverswritten > 1;

But that brings up another issue: why is MyFdwConnections named that
way and why does it have those semantics?  That is, why do we really
need a list of every FDW connection?  I think we really only need the
ones that are 2PC-capable writes.  If a non-2PC-capable foreign server
is involved in the transaction, then we don't really to keep it in a
list.  We just need to flag the transaction as not locally prepareable
i.e. clear TwoPhaseReady.  I think that this could all be figured out
in a much less onerous way: if we ever perform a modification of a
foreign table, have nodeModifyTable.c either mark the transaction
non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign
server is not 2PC capable, or otherwise add the appropriate
information to MyFdwConnections, which can then be renamed to indicate
that it contains only information about preparable stuff.  Then you
don't need each individual FDW to be responsible about calling some
new function; the core code just does the right thing automatically.

+        if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+                                        fdw_conn->umid, true))
+            elog(WARNING, "could not commit transaction on server %s",
+                 fdw_conn->servername);

First, as I noted above, this assumes that the local transaction can't
fail after this point, which is certainly false -- if nothing else,
think about a plug pull.  Second, it assumes that the right thing to
do would be to throw a WARNING, which I think is false: if this is
running in the pre-commit phase, it's not too late to switch to the
abort path, and certainly if we make changes on only 1 server and the
commit on that server fails, we should be rolling back, not committing
with a warning.  Third, if we did need to restrict ourselves to
warnings here, that's probably impractical.  This function needs to do
network I/O, which is not a no-fail operation, and even if it works,
the remote side can fail in any arbitrary way.

+FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,

This is not a very clear function name.  Apparently, the thing that is
doing the registration is an FdwXact, and the thing being registered
is also an FdwXact.

+        /*
+         * Between FdwXactRegisterFdwXact call above till this
backend hears back
+         * from foreign server, the backend may abort the local transaction
+         * (say, because of a signal). During abort processing, it will send
+         * an ABORT message to the foreign server. If the foreign server has
+         * not prepared the transaction, the message will succeed. If the
+         * foreign server has prepared transaction, it will throw an error,
+         * which we will ignore and the prepared foreign transaction will be
+         * resolved by a foreign transaction resolver.
+         */
+        if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid,
+                                            fdw_conn->umid, fdw_xact_id))
+        {

Again, I think this is an impractical API contract.  It assumes that
prepare_foreign_xact can't throw errors, which is likely to make for a
lot of implementation problems on the FDW side -- you can't even
palloc.  You can't call any existing code you've already got that
might throw an error for any reason.  You definitely can't do a
syscache lookup.  You can't accept interrupts, even though you're
doing network I/O that could hang.  The only reason to structure it
this way is to avoid having a transaction that we think is prepared in
our local bookkeeping when, on the remote side, it really isn't.  But
that is an unavoidable problem, because the whole system could crash
after the remote prepare has been done and before prepare_foreign_xact
even returns control.  Or we could fail after XLogInsert() and before

AtEOXact_FdwXacts has the same problem.

I think you can fix make this a lot cleaner if you make it part of the
API contract that resolver shouldn't fail when asked to roll back a
transaction that was never prepared.  Then it can work like this: just
before we prepare the remote transaction, we note the information in
shared memory and in the WAL.  If that fails, we just abort.  When the
resolver sees that our xact is no longer running and did not commit,
it concludes that we must have failed and tries to roll back all the
remotely-prepared xacts.  If they don't exist, then the PREPARE
failed; if they do, then we roll 'em back.  On the other hand, if all
of the prepares succeed, then we commit.  Seeing that our XID is no
longer running and did commit, the resolver tries to commit those
remotely-prepared xacts.  In the commit case, we log a complaint if we
don't find the xact, but in the abort case, it's got to be an expected
outcome.  If we really wanted to get paranoid about this, we could log
a WAL record after finishing all the prepares -- or even each prepare
-- saying "hey, after this point the remote xact should definitely
exist!".  And then the resolver could complaints a nonexistent remote
xact when rolling back only if that record hasn't been issued yet.
But to me that looks like overkill; the window between issuing that
record and issuing the actual commit record would be very narrow, and
in most cases they would be flushed together anyway.  We could of
course force the new record to be separately flushed first, but that's
just making everything slower for no gain.

Note that this requires that we not truncate away portions of clog
that contain commit status information about no-longer-running
transactions that have unresolved FDW 2PC xacts, or at least that we
issue a WAL record updating the state of the fdw_xact so that it
doesn't refer to that portion of clog any more - e.g. by setting the
XID to either FrozenTransactionId or InvalidTransactionId, though that
would be a problem since it would break the unique-XID assumption.   I
don't see the patch doing either of those things right now, although I
may have missed it.  Note that here again the differences between
local 2PC and FDW 2PC really make a difference.  Local 2PC has a
PGPROC+PGXACT, so the regular snaphot-taking code suffices to prevent
clog truncation, because the relevant XIDs are showing up in
snapshots.  The PrescanPreparedTransactions stuff only needs to nail
things down until we reach consistency, and then the regular
mechanisms take over.  We have no such easy out for this system.

+    if (unlink(path))
+        if (errno != ENOENT || giveWarning)

Poor style.  Use &&, not nested ifs.  Oh, I guess you copied this from
the twophase.c code; but let's fix it anyway.

It's not exactly clear to me what the point of "locked" FdwXacts is,
but whatever that point may be, how can remove_fdw_xact() get away
with summarily releasing a lock that may be held by some other
process?  If we're the process that has the FdwXact locked, then we'll
delete it from MyLockedFdwXacts, but if some other process has it
locked, nothing will happen.  If that's safe for some non-obvious
reason, it at least needs a comment.

I think this whole function could be written with a lot less nesting
if you first write a loop to find the appropriate value for cnt, then
error out if we end up with cnt >= FdwXactCtl->numFdwXacts, and then
finally do all of the stuff that happens once we identify a match.
That saves two levels of indentation for most of the function.

The delayCkpt interlocking which twophase.c uses is absent here.
That's important, because otherwise a checkpoint can happen between
the time we write to WAL and the time we actually perform the on-disk
operations.  If a crash happens just before the operation is actually
performed, then it never happens on the master but still happens on
the standbys.  Oops.

+FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid)
+    FdwXact    fdw_xact;
+    Assert(RecoveryInProgress());
+    fdw_xact = get_fdw_xact(xid, serverid, userid);
+    if (fdw_xact)
+    {
+        /* Now we can clean up any files we already left */
+        Assert(fdw_xact->inredo);
+        remove_fdw_xact(fdw_xact);
+    }
+    else
+    {
+        /*
+         * Entry could be on disk. Call with giveWarning = false
+         * since it can be expected during replay.
+         */
+        RemoveFdwXactFile(xid, serverid, userid, false);
+    }

I hope this won't sound too harsh, but he phrase that comes to mind
here is "spaghetti code".  First, we look for a matching FdwXact in
shared memory and, if we find one, do all of the cleanup inside
remove_fdw_xact() which also removes it from the disk.  Otherwise, we
try to remove it from disk anyway.  if (condition) do_a_and_b(); else
do_b(); is not generally a good way to structure code. Moreover, it's
not clear why we should be doing it like this in the first place.
There's no similar logic in twophase.c; PrepareRedoRemove does nothing
on disk if the state isn't found in memory.  The fact that you've got
this set up this way suggests that you haven't structured things so as
to guarantee that the in-memory state is always accurate.  If so, that
should be fixed; if not, this code isn't needed anyway.

+    if (!fdwXactExitRegistered)
+    {
+        before_shmem_exit(AtProcExit_FdwXact, 0);
+        fdwXactExitRegistered = true;
+    }

Sadly, this code has a latent hazard.  If somebody ever calls this
from inside a PG_ENSURE_ERROR_CLEANUP() block, then they can end up
failing to unregister their handler, because of the limitations
described in before_shmem_cleanup()'s handler.  It's better to do this
in FdwXactShmemInit().

+    if (list_length(entries_to_resolve) == 0)

Here again, just test against NIL.

fdwxact_resolver.c is very light on comments.

+void FdwXactRslvLoop(void)

Not project style.  There are other, similar instances.

+    need_twophase = TwoPhaseCommitRequired();

I think this nomenclature is going to cause confusion.  We need to
distinguish somehow between using remote 2PC for foreign transactions
and using local 2PC.  The TwoPhase naming is already well-established
as referring to the latter, so I think we should name this some other

+    if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+    {
+        Form_pg_foreign_server srvForm = (Form_pg_foreign_server)
+        ereport(ERROR,
+                (errmsg("server \"%s\" has unresolved prepared
transactions on it",
+                        NameStr(srvForm->srvname))));
+    }

I think if this happens, it would be more appropriate to just issue a
WARNING and forget about those transactions.  Blocking DROP is not
nice, and shouldn't be done without a really good reason.

+                (errmsg("preparing foreign transactions
(max_prepared_foreign_transactions > 0) requires
maX_foreign_xact_resolvers > 0")));

Bogus capitalization.

+#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8)

I am very much not impressed by this uncommented macro definition.
You can probably guess the reason.  :-)

+            ereport(WARNING, (errmsg("could not resolve dangling
foreign transaction for xid %u, foreign server %u and user %d",
+                                     fdwxact->local_xid,
fdwxact->serverid, fdwxact->userid)));

Formatting is wrong.

My ability to concentrate on this patch is just about exhausted for
today so I think I'm going to have to stop here.  But in general I
would say this patch still needs a lot of work.  As noted above, the
concurrency, crash-safety, and error-handing issues don't seem to have
been thought through carefully enough, and there are even a fairly
large number of trivial spelling errors and coding and/or message
style violations.  Comments are lacking in some places where they are
clearly needed.  There seems to be a fair amount of work needed to
ensure that each thing has exactly one name: not two, and not a shared
name with something else, and that all of those names are clear.
There are a few TODO items remaining in the code.  I think that it is
going to take a significant effort to get all of this cleaned up.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Reply via email to