Thank you for your valuable comments. I've made a few adjustments.
The main goal of my changes is to let long read-only transactions run on
replica if hot_standby_feedback is turned on.
Patch1 - hsfeedback_av_truncate.patch is made to stop
ResolveRecoveryConflictWithLock occurs on replica, after autovacuum lazy
truncates heap on master cutting some pages at the end. When
hot_standby_feedback is on, we know that the autovacuum does not remove
anything superfluous, which could be needed on standby, so there is no
need to rise any ResolveRecoveryConflict*.
1) Add to xl_standby_locks and xl_smgr_truncate isautovacuum flag, which
tells us that autovacuum generates them.
2) When autovacuum decides to trim the table (using lazy_truncate_heap),
it takes AccessExclusiveLock and sends this lock to the replica, but
replica should ignore AccessExclusiveLock if hot_standby_feedback=on.
3) When autovacuum truncate wal message is replayed on a replica, it
takes ExclusiveLock on a table, so as not to interfere with read-only
requests.
We have two cases of resolving ResolveRecoveryConflictWithLock if timers
(max_standby_streaming_delay and max_standby_archive_delay) have run
out:
backend is idle in transaction (waiting input) - in this case backend
will be sent SIGTERM
backend transaction is running query - in this case running transaction
will be aborted
How to test:
Make async replica, turn on feedback and reduce
max_standby_streaming_delay.
Make autovacuum more aggressive.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0
Test1:
Here we will do a load on the master and simulation of a long
transaction with repeated 1 second SEQSCANS on the replica (by calling
pg_sleep 1 second duration every 6 seconds).
MASTER REPLICA
hot_standby = on
max_standby_streaming_delay = 1s
hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
start
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT pg_sleep(value) FROM test;
\watch 6
---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.
On Patched version lazy_vacuum_truncation passed without fatal errors.
Only some times Error occurs because this tests is too synthetic
ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
Because of rising ResolveRecoveryConflictWithSnapshot while
redo some visibility flags to avoid this conflict we can do test2 or
increase max_standby_streaming_delay.
Test2:
Here we will do a load on the master and simulation of a long
transaction on the replica (by taking LOCK on table)
MASTER REPLICA
hot_standby = on
max_standby_streaming_delay = 1s
hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 1 AS value FROM generate_series(1,1)
id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
start
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
LOCK TABLE test IN ACCESS SHARE MODE;
select * from test;
\watch 6
---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.
On Patched version lazy_vacuum_truncation passed without fatal errors.
Test3:
Here we do a load on the master and simulation of a long transaction
with repeated 1 second SEQSCANS on the replica (by calling pg_sleep 1
second duration every 6 seconds).
MASTER REPLICA
hot_standby = on
max_standby_streaming_delay = 4s
hot_standby_feedback = on
start
CREATE TABLE test AS (SELECT id, 200 AS value
FROM generate_series(1,1) id);
pgbench -T600 -P2 -n --file=master.sql postgres
(update test set value = value;)
start
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT pg_sleep(value) FROM test;
---Autovacuum truncate pages at the end
Result on replica:
FATAL: terminating connection due to conflict with recovery
DETAIL: User was holding a relation lock for too long.
On Patched version lazy_vacuum_truncation passed without fatal errors.
This way we can make transactions with SEQSCAN, INDEXSCAN or BITMAPSCAN
Patch2 - hsfeedback_noninvalide_xmin.patch
When walsender is initialized, its xmin in PROCARRAY is set to
GetOldestXmin() in order to prevent autovacuum running on master from
truncating relation and removing some pages that are required by
replica. This might happen if master's autovacuum and replica's query
started simultaneously. And the replica has not yet reported its xmin
value.
How to test:
Make async replica, turn on feedback, reduce max_standby_streaming_delay
and aggressive autovacuum.
autovacuum = on
autovacuum_max_workers = 1
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_cost_delay = 0
Test:
Here we will start replica and begi repeatable read transaction on
table, then we stop replicas postmaster to prevent starting walreceiver
worker (on master startup) and sending master it`s transaction xmin over
hot_standby_feedback message.
MASTER REPLICA
start
CREATE TABLE test AS (SELECT id, 1 AS value FROM
generate_series(1,10000000) id);
stop
start
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
SELECT * FROM test;
stop postmaster with gdb
start
DELETE FROM test WHERE id > 0;
wait till autovacuum delete and changed xmin
release postmaster with gdb
--- Result on replica
FATAL: terminating connection due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be
removed.
There is one feature of the behavior of standby, which let us to allow
the autovacuum to cut off the page table (at the end of relation) that
no one else needs (because there is only dead and removed tuples). So if
the standby SEQSCAN or another *SCAN mdread a page that is damaged or
has been deleted, it will receive a zero page, and not break the request
for ERROR.
Could you give me your ideas over these patches.
--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cff49ba..8e6c525 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -27,8 +27,10 @@
#include "catalog/catalog.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "postmaster/autovacuum.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
+#include "storage/lock.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -269,6 +271,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
xlrec.blkno = nblocks;
xlrec.rnode = rel->rd_node;
xlrec.flags = SMGR_TRUNCATE_ALL;
+ xlrec.isautovacuum = IsAutoVacuumWorkerProcess();
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -495,6 +498,16 @@ smgr_redo(XLogReaderState *record)
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
SMgrRelation reln;
Relation rel;
+ bool isautovacuum = false;
+
+ /*
+ * Check iff truncation made by autovacuum, then take Exclusive lock
+ * because previously AccessEclusive lock was blocked from master to
+ * let long transctions run on replica.
+ * NB: do it only InHotStandby
+ */
+ if (InHotStandby)
+ isautovacuum = xlrec->isautovacuum;
reln = smgropen(xlrec->rnode, InvalidBackendId);
@@ -525,10 +538,29 @@ smgr_redo(XLogReaderState *record)
if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
{
+ LOCKTAG locktag;
+
+ /*
+ * If the value isautovacuum is true, then we assume that truncate
+ * wal was formed by the autovacuum and we ourselves have to take
+ * ExclusiveLock on the relation, because we didn`t apply
+ * AccessExclusiveLock from master to let long transactions to work
+ * on relica.
+ */
+ if (isautovacuum)
+ {
+ /* Behave like LockRelationForExtension */
+ SET_LOCKTAG_RELATION_EXTEND(locktag, xlrec->rnode.dbNode, xlrec->rnode.relNode);
+ (void) LockAcquire(&locktag, ExclusiveLock, false, false);
+ }
+
smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
/* Also tell xlogutils.c about it */
XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno);
+
+ if (isautovacuum)
+ LockRelease(&locktag, ExclusiveLock, true);
}
/* Truncate FSM and VM too */
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 44ed209..34fbd30 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -23,6 +23,7 @@
#include "access/xloginsert.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "postmaster/autovacuum.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/proc.h"
@@ -37,6 +38,7 @@
int vacuum_defer_cleanup_age;
int max_standby_archive_delay = 30 * 1000;
int max_standby_streaming_delay = 30 * 1000;
+extern bool hot_standby_feedback;
static List *RecoveryLockList;
@@ -805,10 +807,17 @@ standby_redo(XLogReaderState *record)
xl_standby_locks *xlrec = (xl_standby_locks *) XLogRecGetData(record);
int i;
- for (i = 0; i < xlrec->nlocks; i++)
- StandbyAcquireAccessExclusiveLock(xlrec->locks[i].xid,
- xlrec->locks[i].dbOid,
- xlrec->locks[i].relOid);
+ /*
+ * If this xlog standby lock was formed by autovacuum, then ignore it
+ * because this can cause a lock conflict with a long transaction
+ * running on the replica and kill transaction or its backend.
+ * It is importent on hot standbys with hot_standby_feedback = on
+ */
+ if (!xlrec->isautovacuum)
+ for (i = 0; i < xlrec->nlocks; i++)
+ StandbyAcquireAccessExclusiveLock(xlrec->locks[i].xid,
+ xlrec->locks[i].dbOid,
+ xlrec->locks[i].relOid);
}
else if (info == XLOG_RUNNING_XACTS)
{
@@ -1031,6 +1040,7 @@ LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
xl_standby_locks xlrec;
xlrec.nlocks = nlocks;
+ xlrec.isautovacuum = IsAutoVacuumWorkerProcess();
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 5738071..049de955 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -48,6 +48,7 @@ typedef struct xl_smgr_truncate
BlockNumber blkno;
RelFileNode rnode;
int flags;
+ bool isautovacuum; /* mark that autovacuum called xl_smgr_truncate */
} xl_smgr_truncate;
extern void log_smgrcreate(RelFileNode *rnode, ForkNumber forkNum);
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index bb61448..dadceb3 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -38,6 +38,7 @@ extern void standby_desc_invalidations(StringInfo buf,
typedef struct xl_standby_locks
{
int nlocks; /* number of entries in locks array */
+ bool isautovacuum; /* mark that autovacuum called xl_standby_locks */
xl_standby_lock locks[FLEXIBLE_ARRAY_MEMBER];
} xl_standby_locks;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d46374d..ed756be 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -279,6 +279,14 @@ InitWalSender(void)
/* Initialize empty timestamp buffer for lag tracking. */
memset(&LagTracker, 0, sizeof(LagTracker));
+
+ /*
+ * Initialize walsenders xmin for hot_standby_feedback corner case when
+ * autovacuum GetOldestXmin and truncates tuples that replica needs, but has not
+ * yet informed the master because starts transaction at same time with autovacuum.
+ * If hot_standby_feedback is off walsender will send at least one feedback message.
+ */
+ MyPgXact->xmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
}
/*