[HACKERS] [RFC: bug fix?] Connection attempt block forever when the synchronous standby is not running

MauMau Fri, 04 Jul 2014 06:56:32 -0700

Hello,

My customer reported a strange connection hang problem. He and I couldn'treproduce it. I haven't been able to understand the cause, but I can thinkof one hypothesis. Could you give me your opinions on whether my hypothesisis correct, and a direction on how to fix the problem? I'm willing tosubmit a patch if necessary.



[Problem]

The customer is using synchronous streaming replication with PostgreSQL9.2.8. The cluster consists of two nodes.


He performed archive recovery test like this:

1. Take a base backup. At that time, some notable settings inpostgresql.conf are:

synchronous_standby_names = 'node2'
autovacuum = on
# synchronous_commit is commented out, so it's on by default

2. Some update operations.  I don't know what.

3. Shutdown the primary and promote the standby.

4. Shutdown the new primary.

5. Perform archive recovery. That is, restore the base backup, createrecovery.conf, and do pg_ctl start.

6. Immediately after the archive recovery is complete, connect to thedatabase server and perform some queries to check user data.


The steps 5 and 6 are done in some recovery script.

However, the connection attempt in step 6 got stuck for 12 hours, and thetest was canceled. The stack trace was:


#0  0x0000003f4badf258 in poll () from /lib64/libc.so.6
#1  0x0000000000619b94 in WaitLatchOrSocket ()
#2  0x0000000000640c4c in SyncRepWaitForLSN ()
#3  0x0000000000491c18 in RecordTransactionCommit ()
#4  0x0000000000491d98 in CommitTransaction ()
#5  0x0000000000493135 in CommitTransactionCommand ()
#6  0x000000000074938a in InitPostgres ()
#7  0x000000000066ddd7 in PostgresMain ()
#8  0x0000000000627d81 in PostmasterMain ()
#9  0x00000000005c4803 in main ()

The connection attempt is waiting for a reply from the standby. This isstrange, because we didn't anticipate that the connection establishment (andsubsequent SELECT queries) would update something and write some WAL. Thedoc says:


http://www.postgresql.org/docs/current/static/warm-standby.html#SYNCHRONOUS-REPLICATION

"When requesting synchronous replication, each commit of a write transactionwill wait until confirmation is received that the commit has been written tothe transaction log on disk of both the primary and standby server.

...

Read only transactions and transaction rollbacks need not wait for repliesfrom standby servers. Subtransaction commits do not wait for responses fromstandby servers, only top-level commits."



[Hypothesis]
Why does the connection processing emit WAL?

Probably, it did page-at-a-time vacuum during access to pg_database andpg_authid for client authentication. src/backend/access/heap/README.HOTdescribes:


"Effectively, space reclamation happens during tuple retrieval when the
page is nearly full (<10% free) and a buffer cleanup lock can be
acquired.  This means that UPDATE, DELETE, and SELECT can trigger space
reclamation, but often not during INSERT ... VALUES because it does
not retrieve a row."

But the customer could not reproduce the problem when he performed the samearchive recovery from the same base backup again. Why? I guess theautovacuum daemon vacuumed the system catalogs before he attempted toconnect to the database.


Is this correct?


[How to fix]

Of course, adding "-o '-c synchronous_commit=local'" or "-o '-csynchronous_standby_names='" to pg_ctl start in the recovery script wouldprevent the problem.

But isn't there anything to fix in PostgreSQL? I think the doc needsimprovement so that users won't misunderstand that only write transactionswould block at commit.

Do you think something else should be done? I guess pg_basebackup,pg_isready, and PQping() called in pg_ctl -w start/restart would blocklikewise, and I'm afraid users don't anticipate it. pg_upgrade appears toset synchronous_commit to local when starting the database server.


Regards
MauMau





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [RFC: bug fix?] Connection attempt block forever when the synchronous standby is not running

Reply via email to