I wrote: > Anyway, having vented about that ... it's not very clear to me whether the > test script is at fault for not being careful to let the slave catch up to > the master before we promote it (and then deem the master to be usable as > a slave without rebuilding it first), or whether we actually imagine that > should work, in which case there's a replication logic bug here someplace.
OK, now that I can make some sense of what's going on in the 009 test script ... it seems like that test script is presuming synchronous replication behavior, but it's only actually set up sync rep in one direction, namely the london->paris direction. The failure occurs when we lose data in the paris->london direction. Specifically, with the delay hack in place, I find this in the log before things go south completely: # Now london is master and paris is slave ok 11 - Restore prepared transactions from files with master down ### Enabling streaming replication for node "paris" ### Starting node "paris" # Running: pg_ctl -D /home/postgres/pgsql/src/test/recovery/tmp_check/data_paris_xSFF/pgdata -l /home/postgres/pgsql/src/test/recovery/tmp_check/log/009_twophase_paris.log start waiting for server to start.... done server started # Postmaster PID for node "paris" is 30930 psql:<stdin>:1: ERROR: prepared transaction with identifier "xact_009_11" does not exist That ERROR is being reported by the london node, at line 267 of the current script: $cur_master->psql('postgres', "COMMIT PREPARED 'xact_009_11'"); So london is missing a prepared transaction that was created while paris was master, a few lines earlier. (It's not real good that the test script isn't bothering to check the results of any of these queries, although the end-state test I just added should close the loop on that.) london has no idea that it's missing data, but when we restart the paris node a little later, it notices that its WAL is past where london is. I'm now inclined to think that the correct fix is to ensure that we run synchronous rep in both directions, rather than to insert delays to substitute for that. Just setting synchronous_standby_names for node paris at the top of the script doesn't work, because there's at least one place where the script intentionally issues commands to paris while london is stopped. But we could turn off sync rep for that step, perhaps. Anyone have a different view of what to fix here? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers