On Fri, Oct 2, 2015 at 4:27 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> ... do you want to produce an updated patch?  I'm not planning to look at
> it until tomorrow anyway.

Attached, revised patch deals with the issues around removing the
query text file when garbage collection encounters trouble. There is
no reason to be optimistic about any error within gc_qtexts() not
recurring during a future garbage collection. OOM might be an
exception, but probably not, since gc_qtexts() is reached when a new
entry is created with a new query text, which in general makes OOM
progressively more likely.

Thanks for accommodating me here.

-- 
Peter Geoghegan
From 9295ecac575d6bed76e3b66a3b4cc1577517f2fc Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Tue, 29 Sep 2015 15:32:51 -0700
Subject: [PATCH] Fix pg_stat_statements garbage collection bugs

Previously, pg_stat_statements garbage collection could encounter
problems when many large, technically distinct queries were executed
without ever reaching the end of execution due to some error.  Garbage
collection could enter a state where it would never complete
successfully, with the external query text file left to grow
indefinitely.  For example, this could happen when a data integration
process fails repeatedly before finally succeeding (perhaps this process
could recur intermittently over a fairly long period).  Problematic
queries might have referenced what are technically distinct tables each
time, as when a CREATE TABLE is performed as part of the data
integration operation, giving each execution a unique queryid.

A spuriously high mean query length was previously possible due to
failed queries' sticky entries artificially inflating mean query length
during hash table eviction.  The mechanism by which garbage collection
is throttled to prevent thrashing could therefore put it off for far too
long.  By the time a garbage collection actually occurred, the file may
have become so big that it may not be possible for memory to be
allocated for all query texts.  Worse still, once the quasi-arbitrary
MaxAllocSize threshold was crossed, it became impossible for the system
to recover even with sufficient memory to allocate a buffer for all
query texts.

To fix these issues, pg_stat_statements no longer considers sticky
entries when calculating mean query text, which should prevent garbage
collection from getting an inaccurate idea of mean query text.
MaxAllocSize no longer caps the size of the buffer that is used to store
query texts in memory during garbage collection; the cap is changed to
MaxAllocHugeSize.  Garbage collection no longer tolerates OOMs (or
exceeding the new MaxAllocHugeSize cap) when allocating a buffer to
store all existing query texts.  Garbage collection now unlink()s the
query text file at the first sign of trouble, giving it a clean slate,
rather than assuming the issue will somehow later resolve itself, which
seemed over optimistic in general.

As a precaution, pg_stat_statements may now avoid storing very large
query texts after parse analysis but ahead of query execution (entries
for which no statistics exist yet).

Backpatch to 9.4, where external query text storage was introduced.
---
 contrib/pg_stat_statements/pg_stat_statements.c | 84 ++++++++++++++++++++++---
 1 file changed, 75 insertions(+), 9 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 59b8a2e..3e26213 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -103,6 +103,7 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
 #define USAGE_INIT				(1.0)	/* including initial planning */
 #define ASSUMED_MEDIAN_INIT		(10.0)	/* initial assumed median usage */
 #define ASSUMED_LENGTH_INIT		1024	/* initial assumed mean query length */
+#define LENGTH_OUTLIER_FACTOR	5		/* mean query length multiplier */
 #define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
 #define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
 #define USAGE_DEALLOC_PERCENT	5		/* free this % of entries at once */
@@ -1111,6 +1112,10 @@ pgss_hash_string(const char *str)
  * If jstate is not NULL then we're trying to create an entry for which
  * we have no statistics as yet; we just want to record the normalized
  * query string.  total_time, rows, bufusage are ignored in this case.
+ * A query text may not actually end up being stored in this case
+ * (storing oversized texts is sometimes avoided), and in any event is
+ * not guaranteed to still be around if and when actual results need to
+ * be stored in a subsequent pgss_store() call.
  */
 static void
 pgss_store(const char *query, uint32 queryId,
@@ -1159,7 +1164,27 @@ pgss_store(const char *query, uint32 queryId,
 		 */
 		if (jstate)
 		{
+			Size		mean_query_len = Max(ASSUMED_LENGTH_INIT,
+											 pgss->mean_query_len);
+
 			LWLockRelease(pgss->lock);
+
+			/*
+			 * When creating an entry for which we have no statistics yet,
+			 * avoid storing very large query texts.  This prevents bloating of
+			 * the query text file in certain scenarios.
+			 *
+			 * For example, excessive churn might otherwise happen when a user
+			 * repeatedly runs a data integration transaction that fails after
+			 * we create sticky entries that are bound to never "unstick".
+			 * This might be due to transactions creating a new table for each
+			 * failed attempt; a new relation OID is involved each time, making
+			 * each failed attempt create distinct entries with large query
+			 * texts.
+			 */
+			if (query_len > mean_query_len * LENGTH_OUTLIER_FACTOR)
+				return;
+
 			norm_query = generate_normalized_query(jstate, query,
 												   &query_len,
 												   encoding);
@@ -1733,11 +1758,20 @@ entry_dealloc(void)
 		entries[i++] = entry;
 		/* "Sticky" entries get a different usage decay rate. */
 		if (entry->counters.calls == 0)
+		{
 			entry->counters.usage *= STICKY_DECREASE_FACTOR;
+			/* Accumulate generic placeholder length into total size. */
+			totlen += pgss->mean_query_len;
+		}
 		else
+		{
 			entry->counters.usage *= USAGE_DECREASE_FACTOR;
-		/* Accumulate total size, too. */
-		totlen += entry->query_len + 1;
+			/* Accumulate total size (use placeholder for dropped texts). */
+			if (entry->query_len >= 0)
+				totlen += entry->query_len + 1;
+			else
+				totlen += pgss->mean_query_len;
+		}
 	}
 
 	qsort(entries, i, sizeof(pgssEntry *), entry_cmp);
@@ -1892,7 +1926,7 @@ qtext_load_file(Size *buffer_size)
 	}
 
 	/* Allocate buffer; beware that off_t might be wider than size_t */
-	if (stat.st_size <= MaxAllocSize)
+	if (stat.st_size <= MaxAllocHugeSize)
 		buf = (char *) malloc(stat.st_size);
 	else
 		buf = NULL;
@@ -1984,6 +2018,10 @@ need_gc_qtexts(void)
 	 * for file's large size.  We go to the trouble of maintaining the mean
 	 * query length in order to prevent garbage collection from thrashing
 	 * uselessly.
+	 *
+	 * Note that sticky entries do not contribute to mean query length.
+	 * Large query texts from a succession of slightly distinct queries that
+	 * all fail to finish execution will not stall this process.
 	 */
 	if (extent < pgss->mean_query_len * pgss_max * 2)
 		return false;
@@ -2001,14 +2039,17 @@ need_gc_qtexts(void)
  * becomes unreasonably large, with no other method of compaction likely to
  * occur in the foreseeable future.
  *
- * The caller must hold an exclusive lock on pgss->lock.
+ * The caller must hold an exclusive lock on pgss->lock.  At the first sign of
+ * trouble we unlink() the query text file to get a clean slate (although the
+ * existing statistics themselves are retained), rather than risk thrashing by
+ * allowing the same problem case to recur indefinitely.
  */
 static void
 gc_qtexts(void)
 {
 	char	   *qbuffer;
 	Size		qbuffer_size;
-	FILE	   *qfile;
+	FILE	   *qfile = NULL;
 	HASH_SEQ_STATUS hash_seq;
 	pgssEntry  *entry;
 	Size		extent;
@@ -2023,12 +2064,12 @@ gc_qtexts(void)
 		return;
 
 	/*
-	 * Load the old texts file.  If we fail (out of memory, for instance) just
-	 * skip the garbage collection.
+	 * Load the old texts file.  If we fail (out of memory, for instance),
+	 * invalidate query texts.  This should be rare.
 	 */
 	qbuffer = qtext_load_file(&qbuffer_size);
 	if (qbuffer == NULL)
-		return;
+		goto gc_fail;
 
 	/*
 	 * We overwrite the query texts file in place, so as to reduce the risk of
@@ -2147,7 +2188,32 @@ gc_fail:
 		entry->query_len = -1;
 	}
 
-	/* Seems like a good idea to bump the GC count even though we failed */
+	/* Unlink query text file to get a clean slate and prevent thrashing */
+	unlink(PGSS_TEXT_FILE);
+	pgss->mean_query_len = ASSUMED_LENGTH_INIT;
+	pgss->extent = 0;
+	/* Allocate new query text temp file */
+	qfile = AllocateFile(PGSS_TEXT_FILE, PG_BINARY_W);
+	if (qfile == NULL)
+		ereport(LOG,
+				(errcode_for_file_access(),
+				 errmsg("could not write new pg_stat_statement file \"%s\": %m",
+						PGSS_TEXT_FILE)));
+	else
+		FreeFile(qfile);
+
+	/*
+	 * Bump the GC count even though we failed.
+	 *
+	 * This is needed to make concurrent readers of file without any lock on
+	 * pgss->lock notice existence of new version of file.  Once readers
+	 * subsequently observe a change in GC count with pgss->lock held, that
+	 * forces a safe reopen of file.  Writers also require that we bump
+	 * here, of course.  (As required by locking protocol, readers and
+	 * writers don't trust earlier file contents until gc_count is found
+	 * unchanged after pgss->lock acquired in shared or exclusive mode
+	 * respectively.)
+	 */
 	record_gc_qtexts();
 }
 
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to