Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

Ivan Kartyshov Wed, 28 Feb 2018 06:26:09 -0800

Thank you for your valuable comments. I've made a few adjustments.

The main goal of my changes is to let long read-only transactions run onreplica if hot_standby_feedback is turned on.

Patch1 - hsfeedback_av_truncate.patch is made to stopResolveRecoveryConflictWithLock occurs on replica, after autovacuum lazytruncates heap on master cutting some pages at the end. Whenhot_standby_feedback is on, we know that the autovacuum does not removeanything superfluous, which could be needed on standby, so there is noneed to rise any ResolveRecoveryConflict*.

1) Add to xl_standby_locks and xl_smgr_truncate isautovacuum flag, whichtells us that autovacuum generates them.

2) When autovacuum decides to trim the table (using lazy_truncate_heap),it takes AccessExclusiveLock and sends this lock to the replica, butreplica should ignore AccessExclusiveLock if hot_standby_feedback=on.

3) When autovacuum truncate wal message is replayed on a replica, ittakes ExclusiveLock on a table, so as not to interfere with read-onlyrequests.

We have two cases of resolving ResolveRecoveryConflictWithLock if timers(max_standby_streaming_delay and max_standby_archive_delay) have runout:backend is idle in transaction (waiting input) - in this case backendwill be sent SIGTERMbackend transaction is running query - in this case running transactionwill be aborted


How to test:

Make async replica, turn on feedback and reducemax_standby_streaming_delay.

Make autovacuum more aggressive.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0

Test1:

Here we will do a load on the master and simulation of a longtransaction with repeated 1 second SEQSCANS on the replica (by callingpg_sleep 1 second duration every 6 seconds).

MASTER        REPLICA
    hot_standby = on
    max_standby_streaming_delay = 1s
    hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
    start
    BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
    SELECT pg_sleep(value) FROM test;
    \watch 6

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

Only some times Error occurs because this tests is too synthetic
ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
Because of rising ResolveRecoveryConflictWithSnapshot while

redo some visibility flags to avoid this conflict we can do test2 orincrease max_standby_streaming_delay.


Test2:

Here we will do a load on the master and simulation of a longtransaction on the replica (by taking LOCK on table)

MASTER        REPLICA
    hot_standby = on
    max_standby_streaming_delay = 1s
    hot_standby_feedback = on
start

CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1)id);

pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
    start
    BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
    LOCK TABLE test IN ACCESS SHARE MODE;
    select * from test;
    \watch 6

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

Test3:

Here we do a load on the master and simulation of a long transactionwith repeated 1 second SEQSCANS on the replica (by calling pg_sleep 1second duration every 6 seconds).

MASTER        REPLICA
    hot_standby = on
    max_standby_streaming_delay = 4s
    hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 200 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
    start
    BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
    SELECT pg_sleep(value) FROM test;

---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.

On Patched version lazy_vacuum_truncation passed without fatal errors.

This way we can make transactions with SEQSCAN, INDEXSCAN or BITMAPSCAN


Patch2 - hsfeedback_noninvalide_xmin.patch

When walsender is initialized, its xmin in PROCARRAY is set toGetOldestXmin() in order to prevent autovacuum running on master fromtruncating relation and removing some pages that are required byreplica. This might happen if master's autovacuum and replica's querystarted simultaneously. And the replica has not yet reported its xminvalue.


How to test:

Make async replica, turn on feedback, reduce max_standby_streaming_delayand aggressive autovacuum.

autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0

Test:

Here we will start replica and begi repeatable read transaction ontable, then we stop replicas postmaster to prevent starting walreceiverworker (on master startup) and sending master it`s transaction xmin overhot_standby_feedback message.

MASTER        REPLICA
start

CREATE TABLE test AS (SELECT id, 1 AS value FROMgenerate_series(1,10000000) id);

stop
    start
    BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
    SELECT * FROM test;
    stop postmaster with gdb
start
DELETE FROM test WHERE id > 0;
wait till autovacuum delete and changed xmin
            release postmaster with gdb
--- Result on replica
FATAL: terminating connection due to conflict with recovery

DETAIL: User query might have needed to see row versions that must beremoved.

There is one feature of the behavior of standby, which let us to allowthe autovacuum to cut off the page table (at the end of relation) thatno one else needs (because there is only dead and removed tuples). So ifthe standby SEQSCAN or another *SCAN mdread a page that is damaged orhas been deleted, it will receive a zero page, and not break the requestfor ERROR.


Could you give me your ideas over these patches.

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cff49ba..8e6c525 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -27,8 +27,10 @@
 #include "catalog/catalog.h"
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
+#include "postmaster/autovacuum.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "storage/lock.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -269,6 +271,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		xlrec.blkno = nblocks;
 		xlrec.rnode = rel->rd_node;
 		xlrec.flags = SMGR_TRUNCATE_ALL;
+		xlrec.isautovacuum = IsAutoVacuumWorkerProcess();
 
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -495,6 +498,16 @@ smgr_redo(XLogReaderState *record)
 		xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
 		SMgrRelation reln;
 		Relation	rel;
+		bool		isautovacuum = false;
+
+		/*
+		 * Check iff truncation made by autovacuum, then take Exclusive lock
+		 * because previously AccessEclusive lock was blocked from master to
+		 * let long transctions run on replica.
+		 * NB: do it only InHotStandby
+		 */
+		if (InHotStandby)
+			isautovacuum = xlrec->isautovacuum;
 
 		reln = smgropen(xlrec->rnode, InvalidBackendId);
 
@@ -525,10 +538,29 @@ smgr_redo(XLogReaderState *record)
 
 		if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
 		{
+			LOCKTAG		locktag;
+
+			/*
+			 * If the value isautovacuum is true, then we assume that truncate
+			 * wal was formed by the autovacuum and we ourselves have to take
+			 * ExclusiveLock on the relation, because we didn`t apply
+			 * AccessExclusiveLock from master to let long transactions to work
+			 * on relica.
+			 */
+			if (isautovacuum)
+			{
+				/* Behave like LockRelationForExtension */
+				SET_LOCKTAG_RELATION_EXTEND(locktag, xlrec->rnode.dbNode, xlrec->rnode.relNode);
+				(void) LockAcquire(&locktag, ExclusiveLock, false, false);
+			}
+
 			smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
 
 			/* Also tell xlogutils.c about it */
 			XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno);
+
+			if (isautovacuum)
+				LockRelease(&locktag, ExclusiveLock, true);
 		}
 
 		/* Truncate FSM and VM too */
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 44ed209..34fbd30 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
 #include "access/xloginsert.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
@@ -37,6 +38,7 @@
 int			vacuum_defer_cleanup_age;
 int			max_standby_archive_delay = 30 * 1000;
 int			max_standby_streaming_delay = 30 * 1000;
+extern bool hot_standby_feedback;
 
 static List *RecoveryLockList;
 
@@ -805,10 +807,17 @@ standby_redo(XLogReaderState *record)
 		xl_standby_locks *xlrec = (xl_standby_locks *) XLogRecGetData(record);
 		int			i;
 
-		for (i = 0; i < xlrec->nlocks; i++)
-			StandbyAcquireAccessExclusiveLock(xlrec->locks[i].xid,
-											  xlrec->locks[i].dbOid,
-											  xlrec->locks[i].relOid);
+		/*
+		 * If this xlog standby lock was formed by autovacuum, then ignore it
+		 * because this can cause a lock conflict with a long transaction
+		 * running on the replica and kill transaction or its backend.
+		 * It is importent on hot standbys with hot_standby_feedback = on
+		 */
+		if (!xlrec->isautovacuum)
+			for (i = 0; i < xlrec->nlocks; i++)
+				StandbyAcquireAccessExclusiveLock(xlrec->locks[i].xid,
+												  xlrec->locks[i].dbOid,
+												  xlrec->locks[i].relOid);
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
@@ -1031,6 +1040,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
 	xl_standby_locks xlrec;
 
 	xlrec.nlocks = nlocks;
+	xlrec.isautovacuum = IsAutoVacuumWorkerProcess();
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 5738071..049de955 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -48,6 +48,7 @@ typedef struct xl_smgr_truncate
 	BlockNumber blkno;
 	RelFileNode rnode;
 	int			flags;
+	bool		isautovacuum;	/* mark that autovacuum called xl_smgr_truncate */
 } xl_smgr_truncate;
 
 extern void log_smgrcreate(RelFileNode *rnode, ForkNumber forkNum);
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index bb61448..dadceb3 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -38,6 +38,7 @@ extern void standby_desc_invalidations(StringInfo buf,
 typedef struct xl_standby_locks
 {
 	int			nlocks;			/* number of entries in locks array */
+	bool		isautovacuum;	/* mark that autovacuum called xl_standby_locks */
 	xl_standby_lock locks[FLEXIBLE_ARRAY_MEMBER];
 } xl_standby_locks;

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d46374d..ed756be 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -279,6 +279,14 @@ InitWalSender(void)
 
 	/* Initialize empty timestamp buffer for lag tracking. */
 	memset(&LagTracker, 0, sizeof(LagTracker));
+
+	/*
+	 * Initialize walsenders xmin for hot_standby_feedback corner case when
+	 * autovacuum GetOldestXmin and truncates tuples that replica needs, but has not
+	 * yet informed the master because starts transaction at same time with autovacuum.
+	 * If hot_standby_feedback is off walsender will send at least one feedback message.
+	 */
+	MyPgXact->xmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
 }
 
 /*

Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

Reply via email to