009_twophase.pl

Tom Lane Sun, 02 Jul 2017 15:03:06 -0700

I wrote:
> Anyway, having vented about that ... it's not very clear to me whether the
> test script is at fault for not being careful to let the slave catch up to
> the master before we promote it (and then deem the master to be usable as
> a slave without rebuilding it first), or whether we actually imagine that
> should work, in which case there's a replication logic bug here someplace.


OK, now that I can make some sense of what's going on in the 009 test
script ... it seems like that test script is presuming synchronous
replication behavior, but it's only actually set up sync rep in one
direction, namely the london->paris direction.  The failure occurs
when we lose data in the paris->london direction.  Specifically,
with the delay hack in place, I find this in the log before things
go south completely:

# Now london is master and paris is slave
ok 11 - Restore prepared transactions from files with master down
### Enabling streaming replication for node "paris"
### Starting node "paris"
# Running: pg_ctl -D 
/home/postgres/pgsql/src/test/recovery/tmp_check/data_paris_xSFF/pgdata -l 
/home/postgres/pgsql/src/test/recovery/tmp_check/log/009_twophase_paris.log 
start
waiting for server to start.... done
server started
# Postmaster PID for node "paris" is 30930
psql:<stdin>:1: ERROR:  prepared transaction with identifier "xact_009_11" does 
not exist

That ERROR is being reported by the london node, at line 267 of the
current script:
        $cur_master->psql('postgres', "COMMIT PREPARED 'xact_009_11'");
So london is missing a prepared transaction that was created while
paris was master, a few lines earlier.  (It's not real good that
the test script isn't bothering to check the results of any of
these queries, although the end-state test I just added should close
the loop on that.)  london has no idea that it's missing data, but
when we restart the paris node a little later, it notices that its
WAL is past where london is.

I'm now inclined to think that the correct fix is to ensure that we
run synchronous rep in both directions, rather than to insert delays
to substitute for that.  Just setting synchronous_standby_names for
node paris at the top of the script doesn't work, because there's
at least one place where the script intentionally issues commands
to paris while london is stopped.  But we could turn off sync rep
for that step, perhaps.

Anyone have a different view of what to fix here?

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Race-like failure in recovery/t/009_twophase.pl

Reply via email to