Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests

Michael Paquier Mon, 01 Aug 2016 18:28:44 -0700

On Wed, Jul 27, 2016 at 10:00 AM, Michael Paquier
<[email protected]> wrote:
> On Mon, Jul 25, 2016 at 10:05 PM, Michael Paquier
> <[email protected]> wrote:
>> On Mon, Jul 25, 2016 at 2:52 PM, Michael Paquier
>> <[email protected]> wrote:
>>> Ah, yes, and that's a stupid mistake. We had better use
>>> replay_location instead of write_location. There is a risk that
>>> records have not been replayed yet even if they have been written on
>>> the standby, so it is possible that the query looking at tab_int may
>>> not see this relation.
>>
>> Or in short, the attached fixes 2) and will help providing input for 1)..
>
> Increasing the timeout had zero effect for test 003:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-07-26%2016%3A00%3A07
> So we're facing something else. I'll dig into that deeper using
> manually hamster.


And so I have been finally able to reproduce this one, and this is a
timing issue in the test.

First see the failure itself:
LOG:  recovery stopping after commit of transaction 548, time
2016-08-01 21:14:37.647104+09
LOG:  recovery has paused
HINT:  Execute pg_xlog_replay_resume() to continue.
LOG:  statement: SELECT '0/30122D0'::pg_lsn <= pg_last_xlog_replay_location()
[keeps waiting for this LSN to be replayed]

But by looking at the WAL records of this failed test I could see the following
rmgr: Transaction len (rec/tot):      8/    34, tx:        547, lsn:
0/03012248, prev 0/03012208, desc: COMMIT 2016-08-01 21:14:37.514805
JST
rmgr: Transaction len (rec/tot):      8/    34, tx:        548, lsn:
0/03012270, prev 0/03012248, desc: COMMIT 2016-08-01 21:14:37.647104
JST
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn:
0/03012298, prev 0/03012270, desc: RUNNING_XACTS nextXid 549
latestCompletedXid 548 oldestRunningXid 549
rmgr: Heap        len (rec/tot):      3/    59, tx:        549, lsn:
0/030122D0, prev 0/03012298, desc: INSERT off 193, blkref #0: rel
1663/12404/16384 blk 8
It is not a surprise if this keeps waiting forever. As the recovery is
paused before it cannot replay the wanted LSN.

Here using pg_xlog_replay_resume() is not the correct solution because
this would cause the node to finish recovery before we want it to, and
so is recovery_target_action = 'promote'. If we look at the test, it
is doing the following when getting the TXID that is used as recovery
target:
$node_master->safe_psql('postgres',
        "INSERT INTO tab_int VALUES (generate_series(1001,2000))");
my $recovery_txid =
  $node_master->safe_psql('postgres', "SELECT txid_current()");
my $lsn2 =
  $node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
What I think we had better do is reverse the calls
pg_current_xlog_location() and txid_current() so as we are sure that
the LSN we track for replay is lower than the real target LSN. The
same problem exists when defining the timestamp target.

The patch attached does that, and it fixes as well the issue with test
001 regarding write_location that should not be used. With this patch
I have let the test 003 run for two hours in a row and it did not
complain. Previously I was able to see a failure at the 3rd~4th
attempts, within 30 minutes.

There is still an issue with pg_basebackup when testing stream mode
and replication slots. I am digging into this one now..
-- 
Michael

recovery-test-fixes.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Increasing timeout of poll_query_until for TAP tests

Reply via email to