Re: [HACKERS] snapbuild woes

Petr Jelinek Tue, 21 Feb 2017 18:06:23 -0800

On 13/12/16 00:38, Petr Jelinek wrote:
> On 12/12/16 23:33, Andres Freund wrote:
>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:
>>> On 12/12/16 22:42, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
>>>>> Hi,
>>>>> First one is outright bug, which has to do with how we track running
>>>>> transactions. What snapbuild basically does while doing initial snapshot
>>>>> is read the xl_running_xacts record, store the list of running txes and
>>>>> then wait until they all finish. The problem with this is that
>>>>> xl_running_xacts does not ensure that it only logs transactions that are
>>>>> actually still running (to avoid locking PGPROC) so there might be xids
>>>>> in xl_running_xacts that already committed before it was logged.
>>>>
>>>> I don't think that's actually true?  Notice how LogStandbySnapshot()
>>>> only releases the lock *after* the LogCurrentRunningXacts() iff
>>>> wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
>>>> observed must actually be a bit more complex :(
>>>>
>>>
>>> Hmm, interesting, I did see the transaction commit in the WAL before the
>>> xl_running_xacts that contained the xid as running. I only seen it on
>>> production system though, didn't really manage to easily reproduce it
>>> locally.
>>
>> I suspect the reason for that is that RecordTransactionCommit() doesn't
>> conflict with ProcArrayLock in the first place - only
>> ProcArrayEndTransaction() does.  So they're still running in the PGPROC
>> sense, just not the crash-recovery sense...
>>
> 
> That looks like reasonable explanation. BTW I realized my patch needs
> bit more work, currently it will break the actual snapshot as it behaves
> same as if the xl_running_xacts was empty which is not correct AFAICS.
>


Hi,

I got to work on this again. Unfortunately I haven't found solution that
I would be very happy with. What I did is in case we read
xl_running_xacts which has all transactions we track finished, we start
tracking from that new xl_running_xacts again with the difference that
we clean up the running transactions based on previously seen committed
ones. That means that on busy server we may wait for multiple
xl_running_xacts rather than just one, but at least we have chance to
finish unlike with current coding which basically waits for empty
xl_running_xacts. I also removed the additional locking for logical
wal_level in LogStandbySnapshot() since it does not work.

I also identified another bug in snapbuild while looking at the code.
That is the logical decoding will try to use on disk serialized snapshot
for initial snapshot export when it can. The problem is that these
snapshots are quite special and are not really usable as snapshots for
data (ie, the logical decoding snapshots regularly have xmax smaller
than xmin). So then when client tries to use this exported snapshot it
gets completely wrong data as the snapshot is broken. I think this is
explanation for Erik Rijker's problems with the initial COPY patch for
logical replication. At least for me the issues go away when I disable
use of the ondisk snapshots.

I didn't really find better solution than that though (disabling the use
of ondisk snapshots for initial consistent snapshot).

So to summarize attached patches:
0001 - Fixes performance issue where we build tons of snapshots that we
don't need which kills CPU.

0002 - Disables the use of ondisk historical snapshots for initial
consistent snapshot export as it may result in corrupt data. This
definitely needs backport.

0003 - Fixes bug where we might never reach snapshot on busy server due
to race condition in xl_running_xacts logging. The original use of extra
locking does not seem to be enough in practice. Once we have agreed fix
for this it's probably worth backpatching. There are still some comments
that need updating, this is more of a PoC.

Thoughts or better ideas?

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

From e8d4dc52bc9dc7e5b17f4e374319fda229b19e61 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmo...@pjmodos.net>
Date: Tue, 21 Feb 2017 19:58:18 +0100
Subject: [PATCH 1/3] Skip unnecessary snapshot builds

When doing initial snapshot build during logical decoding
initialization, don't build snapshots for transactions where we know the
transaction didn't do any catalog changes. Otherwise we might end up
with thousands of useless snapshots on busy server which can be quite
expensive.
---
 src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
 1 file changed, 53 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index ed9f69f..c2476a9 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -972,6 +972,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	bool		forced_timetravel = false;
 	bool		sub_needs_timetravel = false;
 	bool		top_needs_timetravel = false;
+	bool		skip_forced_snapshot = false;
 
 	TransactionId xmax = xid;
 
@@ -993,10 +994,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/*
 		 * We could avoid treating !SnapBuildTxnIsRunning transactions as
 		 * timetravel ones, but we want to be able to export a snapshot when
-		 * we reached consistency.
+		 * we reached consistency so we need to keep track of them.
 		 */
 		forced_timetravel = true;
 		elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
+
+		/*
+		 * It is however desirable to skip building new snapshot for
+		 * !SnapBuildTxnIsRunning transactions as otherwise we might end up
+		 * building thousands of unused snapshots on busy servers which can
+		 * be very expensive.
+		 */
+		if (!SnapBuildTxnIsRunning(builder, xid))
+			skip_forced_snapshot = true;
 	}
 
 	for (nxact = 0; nxact < nsubxacts; nxact++)
@@ -1009,21 +1019,10 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		SnapBuildEndTxn(builder, lsn, subxid);
 
 		/*
-		 * If we're forcing timetravel we also need visibility information
-		 * about subtransaction, so keep track of subtransaction's state.
-		 */
-		if (forced_timetravel)
-		{
-			SnapBuildAddCommittedTxn(builder, subxid);
-			if (NormalTransactionIdFollows(subxid, xmax))
-				xmax = subxid;
-		}
-
-		/*
 		 * Add subtransaction to base snapshot if it DDL, we don't distinguish
 		 * to toplevel transactions there.
 		 */
-		else if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
+		if (ReorderBufferXidHasCatalogChanges(builder->reorder, subxid))
 		{
 			sub_needs_timetravel = true;
 
@@ -1035,6 +1034,16 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			if (NormalTransactionIdFollows(subxid, xmax))
 				xmax = subxid;
 		}
+		/*
+		 * If we're forcing timetravel we also need visibility information
+		 * about subtransaction, so keep track of subtransaction's state.
+		 */
+		else if (forced_timetravel)
+		{
+			SnapBuildAddCommittedTxn(builder, subxid);
+			if (NormalTransactionIdFollows(subxid, xmax))
+				xmax = subxid;
+		}
 	}
 
 	/*
@@ -1043,14 +1052,8 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 	 */
 	SnapBuildEndTxn(builder, lsn, xid);
 
-	if (forced_timetravel)
-	{
-		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
-
-		SnapBuildAddCommittedTxn(builder, xid);
-	}
 	/* add toplevel transaction to base snapshot */
-	else if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
+	if (ReorderBufferXidHasCatalogChanges(builder->reorder, xid))
 	{
 		elog(DEBUG2, "found top level transaction %u, with catalog changes!",
 			 xid);
@@ -1063,10 +1066,18 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 		/* mark toplevel txn as timetravel as well */
 		SnapBuildAddCommittedTxn(builder, xid);
 	}
+	else if (forced_timetravel)
+	{
+		elog(DEBUG2, "forced transaction %u to do timetravel.", xid);
+
+		SnapBuildAddCommittedTxn(builder, xid);
+	}
 
 	/* if there's any reason to build a historic snapshot, do so now */
 	if (forced_timetravel || top_needs_timetravel || sub_needs_timetravel)
 	{
+		bool build_snapshot;
+
 		/*
 		 * Adjust xmax of the snapshot builder, we only do that for committed,
 		 * catalog modifying, transactions, everything else isn't interesting
@@ -1087,14 +1098,29 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 			return;
 
 		/*
-		 * Decrease the snapshot builder's refcount of the old snapshot, note
-		 * that it still will be used if it has been handed out to the
-		 * reorderbuffer earlier.
+		 * Build snapshot if needed. We need to build it if there isn't one
+		 * already built, or if the transaction has made catalog changes or
+		 * when we can't know if transaction made catalog changes.
 		 */
-		if (builder->snapshot)
+		build_snapshot = !builder->snapshot || top_needs_timetravel ||
+			sub_needs_timetravel || !skip_forced_snapshot;
+
+		/*
+		 * Decrease the snapshot builder's refcount of the old snapshot if we
+		 * plan to build new one, note that it still will be used if it has
+		 * been handed out to the reorderbuffer earlier.
+		 */
+		if (builder->snapshot && build_snapshot)
 			SnapBuildSnapDecRefcount(builder->snapshot);
 
-		builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+		/* Build new snapshot unless asked not to. */
+		if (build_snapshot)
+		{
+			builder->snapshot = SnapBuildBuildSnapshot(builder, xid);
+
+			/* refcount of the snapshot builder for the new snapshot */
+			SnapBuildSnapIncRefcount(builder->snapshot);
+		}
 
 		/* we might need to execute invalidations, add snapshot */
 		if (!ReorderBufferXidHasBaseSnapshot(builder->reorder, xid))
@@ -1104,11 +1130,9 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
 										 builder->snapshot);
 		}
 
-		/* refcount of the snapshot builder for the new snapshot */
-		SnapBuildSnapIncRefcount(builder->snapshot);
-
 		/* add a new Snapshot to all currently running transactions */
-		SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
+		if (build_snapshot)
+			SnapBuildDistributeNewCatalogSnapshot(builder, lsn);
 	}
 	else
 	{
-- 
2.7.4

From dcedfdafce82ae65656e32468d84b084e629a70b Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmo...@pjmodos.net>
Date: Tue, 21 Feb 2017 20:14:44 +0100
Subject: [PATCH 2/3] Don't use on disk snapshots for snapshot export in
 logical decoding

We store historical snapshots on disk to enable continuation of logical
decoding after restart. These snapshots were reused by the
slot initialization code when searching for consistent snapshot. However
these snapshots are only useful for catalogs and not for normal user
tables. So when we exported such snapshots for user to read data from
tables that is consistent with a specific LSN of slot creation, user
would instead read wrong data. There does not seem to be simple way to
make the logical decoding historical snapshots useful for normal tables
so don't use them for exporting at all for now.
---
 src/backend/replication/logical/snapbuild.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index c2476a9..0b10044 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1252,11 +1252,11 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 {
 	/* ---
 	 * Build catalog decoding snapshot incrementally using information about
-	 * the currently running transactions. There are several ways to do that:
+	 * the currently running transactions. There are couple ways to do that:
 	 *
 	 * a) There were no running transactions when the xl_running_xacts record
 	 *	  was inserted, jump to CONSISTENT immediately. We might find such a
-	 *	  state we were waiting for b) and c).
+	 *	  state we were waiting for b).
 	 *
 	 * b) Wait for all toplevel transactions that were running to end. We
 	 *	  simply track the number of in-progress toplevel transactions and
@@ -1269,9 +1269,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  Interestingly, in contrast to HS, this allows us not to care about
 	 *	  subtransactions - and by extension suboverflowed xl_running_xacts -
 	 *	  at all.
-	 *
-	 * c) This (in a previous run) or another decoding slot serialized a
-	 *	  snapshot to disk that we can use.
 	 * ---
 	 */
 
@@ -1326,13 +1323,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		return false;
 	}
-	/* c) valid on disk state */
-	else if (SnapBuildRestore(builder, lsn))
-	{
-		/* there won't be any state to cleanup */
-		return false;
-	}
-
 	/*
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
-- 
2.7.4

From 501e6c9842e225a90c52b8bf3f91adba03d38390 Mon Sep 17 00:00:00 2001
From: Petr Jelinek <pjmo...@pjmodos.net>
Date: Wed, 22 Feb 2017 00:57:33 +0100
Subject: [PATCH 3/3] Fix xl_running_xacts usage in snapshot builder

Due to race condition, the xl_running_xacts might contain no longer
running transactions. Previous coding tried to get around this by
additional locking but that did not work correctly for committs. Instead
try combining decoded commits and multiple xl_running_xacts to get the
consistent snapshot.
---
 src/backend/replication/logical/snapbuild.c | 79 +++++++++++++++++++++++------
 src/backend/storage/ipc/standby.c           |  7 +--
 2 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 0b10044..e5f9a33 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1262,7 +1262,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 *	  simply track the number of in-progress toplevel transactions and
 	 *	  lower it whenever one commits or aborts. When that number
 	 *	  (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
-	 *	  to CONSISTENT.
+	 *	  to CONSISTENT. Sometimes we might get xl_running_xacts which has
+	 *	  all tracked transactions as finished. We'll need to restart tracking
+	 *	  in that case and use previously collected committed transactions to
+	 *	  purge transactions mistakenly marked as running in the
+	 *	  xl_running_xacts which exist as a result of race condition in
+	 *	  LogStandbySnapshot().
 	 *	  NB: We need to search running.xip when seeing a transaction's end to
 	 *	  make sure it's a toplevel transaction and it's been one of the
 	 *	  initially running ones.
@@ -1327,11 +1332,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 	 * b) first encounter of a useable xl_running_xacts record. If we had
 	 * found one earlier we would either track running transactions (i.e.
 	 * builder->running.xcnt != 0) or be consistent (this function wouldn't
-	 * get called).
+	 * get called). However it's possible that we could not see all
+	 * transactions that were marked as running in xl_running_xacts, so if
+	 * we get new one that says all were closed but we are not consistent
+	 * yet, we need to restart the tracking while taking previously seen
+	 * transactions into account.
 	 */
-	else if (!builder->running.xcnt)
+	else if (!builder->running.xcnt ||
+			 running->oldestRunningXid > builder->running.xmax)
 	{
 		int			off;
+		int			purge_running = builder->running.xcnt > 0;
 
 		/*
 		 * We only care about toplevel xids as those are the ones we
@@ -1367,26 +1378,15 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		builder->running.xmin = builder->running.xip[0];
 		builder->running.xmax = builder->running.xip[running->xcnt - 1];
 
+
 		/* makes comparisons cheaper later */
 		TransactionIdRetreat(builder->running.xmin);
 		TransactionIdAdvance(builder->running.xmax);
 
 		builder->state = SNAPBUILD_FULL_SNAPSHOT;
 
-		ereport(LOG,
-			(errmsg("logical decoding found initial starting point at %X/%X",
-					(uint32) (lsn >> 32), (uint32) lsn),
-			 errdetail_plural("%u transaction needs to finish.",
-							  "%u transactions need to finish.",
-							  builder->running.xcnt,
-							  (uint32) builder->running.xcnt)));
-
 		/*
-		 * Iterate through all xids, wait for them to finish.
-		 *
-		 * This isn't required for the correctness of decoding, but to allow
-		 * isolationtester to notice that we're currently waiting for
-		 * something.
+		 * Iterate through all xids and do additional checking/purging.
 		 */
 		for (off = 0; off < builder->running.xcnt; off++)
 		{
@@ -1400,9 +1400,56 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 			if (TransactionIdIsCurrentTransactionId(xid))
 				elog(ERROR, "waiting for ourselves");
 
+			/*
+			 * Use gathered info about committed transactions to purge
+			 * committed transactions recorded xl_running_xacts as running
+			 * because of race condition in LogStandbySnapshot(). This may
+			 * be slow but it should be called at most once per slot
+			 * initialization.
+			 */
+			if (purge_running)
+			{
+				int i;
+
+				for (i = 0; i < builder->committed.xcnt; i++)
+				{
+					if (builder->committed.xip[i] == xid)
+					{
+						SnapBuildEndTxn(builder, lsn, xid);
+						continue;
+					}
+				}
+			}
+
+			/*
+			 * This isn't required for the correctness of decoding, but to allow
+			 * isolationtester to notice that we're currently waiting for
+			 * something.
+			 */
 			XactLockTableWait(xid, NULL, NULL, XLTW_None);
 		}
 
+		if (!purge_running)
+		{
+			ereport(LOG,
+				(errmsg("logical decoding found initial starting point at %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+		else
+		{
+			ereport(LOG,
+				(errmsg("logical decoding moved initial starting point to %X/%X",
+						(uint32) (lsn >> 32), (uint32) lsn),
+				 errdetail_plural("%u transaction needs to finish.",
+								  "%u transactions need to finish.",
+								  builder->running.xcnt,
+								  (uint32) builder->running.xcnt)));
+		}
+
 		/* nothing could have built up so far, so don't perform cleanup */
 		return false;
 	}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 6259070..9b41a28 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -945,15 +945,10 @@ LogStandbySnapshot(void)
 	 * record. Fortunately this routine isn't executed frequently, and it's
 	 * only a shared lock.
 	 */
-	if (wal_level < WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
+	LWLockRelease(ProcArrayLock);
 
 	recptr = LogCurrentRunningXacts(running);
 
-	/* Release lock if we kept it longer ... */
-	if (wal_level >= WAL_LEVEL_LOGICAL)
-		LWLockRelease(ProcArrayLock);
-
 	/* GetRunningTransactionData() acquired XidGenLock, we must release it */
 	LWLockRelease(XidGenLock);
 
-- 
2.7.4

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] snapbuild woes

Reply via email to