On Tue, Sep 22, 2015 at 6:01 PM, Peter Geoghegan <p...@heroku.com> wrote:
> On Tue, Sep 22, 2015 at 5:01 PM, Peter Geoghegan <p...@heroku.com> wrote:
>> My guess is that this very large query involved a very large number of
>> constants, possibly contained inside an " IN ( )". Slight variants of
>> the same query, that a human would probably consider to be equivalent
>> have caused artificial pressure on garbage collection.
> I could write a patch to do compaction in-place.

In the end, I decided on a simpler approach to fixing this general
sort of problem with the attached patch. See commit message for

I went this way not because compaction in-place was necessarily a bad
idea, but because I think that a minimal approach will work just as
well in real world cases.

It would be nice to get this committed before the next point releases
are tagged, since I've now heard a handful of complaints like this.

Peter Geoghegan
From 5afb0be51bfe0a7013452a6d591abb75f796a898 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Tue, 29 Sep 2015 15:32:51 -0700
Subject: [PATCH] Fix pg_stat_statements garbage collection bugs

Previously, pg_stat_statements garbage collection could encounter
problems when many large, technically distinct queries were executed
without completing successfully.  Garbage collection could enter a state
where it would never complete successfully, with the external query text
file left to grow indefinitely.  For example, this could happen when a
data integration process fails repeatedly before finally succeeding
(perhaps this process could recur intermittently over a fairly long
period).  Problematic queries might have referenced what are technically
distinct tables each time, as when a CREATE TABLE is performed as part
of the data integration operation, giving each execution a unique

This example scenario should not pose any problem;  garbage collection
should still occur fairly frequently.  However, a spuriously high mean
query length was previously possible due to failed queries' sticky
entries artificially inflating mean query length during hash table
eviction.  The mechanism by which garbage collection is throttled to
prevent thrashing could therefore put it off for far too long.  By the
time a garbage collection actually occurs, the file may have become so
big that it may not be possible for memory to be allocated for all query
texts.  Worse still, once the MaxAllocSize threshold was crossed, it
became impossible for the system to recover even with sufficient memory
to allocate a buffer for all query texts.

pg_stat_statements now no longer weighs sticky entries when calculating
mean query text, which should prevent garbage collection from getting an
inaccurate idea of user-visible mean query text.  As a precaution,
pg_stat_statements may now avoid storing very large query texts after
parse analysis but ahead of query execution (entries for which no
statistics exist yet) based on certain criteria.  Garbage collection no
longer tolerates OOMs when allocating a buffer to store all existing
query texts.  MaxAllocSize no longer caps the buffer returned by
malloc() that is used to store query texts in memory during garbage

Backpatch to 9.4, where external query text storage was introduced.
 contrib/pg_stat_statements/pg_stat_statements.c | 53 +++++++++++++++++++++----
 1 file changed, 46 insertions(+), 7 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 59b8a2e..97584b6 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -103,6 +103,7 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
 #define USAGE_INIT				(1.0)	/* including initial planning */
 #define ASSUMED_MEDIAN_INIT		(10.0)	/* initial assumed median usage */
 #define ASSUMED_LENGTH_INIT		1024	/* initial assumed mean query length */
+#define LENGTH_OUTLIER_FACTOR	5		/* mean query length multiplier */
 #define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
 #define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
 #define USAGE_DEALLOC_PERCENT	5		/* free this % of entries at once */
@@ -1111,6 +1112,10 @@ pgss_hash_string(const char *str)
  * If jstate is not NULL then we're trying to create an entry for which
  * we have no statistics as yet; we just want to record the normalized
  * query string.  total_time, rows, bufusage are ignored in this case.
+ * A query text may not actually end up being stored in this case
+ * (storing oversized texts is sometimes avoided), and in any event is
+ * not guaranteed to still be around if and when actual results need to
+ * be stored in a subsequent pgss_store() call.
 static void
 pgss_store(const char *query, uint32 queryId,
@@ -1159,7 +1164,28 @@ pgss_store(const char *query, uint32 queryId,
 		if (jstate)
+			Size		mean_query_len = Max(ASSUMED_LENGTH_INIT,
+											 pgss->mean_query_len);
+			/*
+			 * When creating an entry for which we have no statistics yet,
+			 * avoid storing very large query texts.  This prevents bloating of
+			 * the query text file in certain scenarios.
+			 *
+			 * For example, excessive churn might otherwise happen when a user
+			 * repeatedly runs a data integration transaction that fails after
+			 * we create sticky entries that are bound to remain sticky.  This
+			 * might be due to transactions creating a new table for each
+			 * failed attempt; a new relation OID is involved each time, making
+			 * each failed attempt create distinct sticky entries with large
+			 * query texts (fingerprinting to create a queryid usually involves
+			 * jumbling relation OIDs).
+			 */
+			if (query_len > mean_query_len * LENGTH_OUTLIER_FACTOR)
+				return;
 			norm_query = generate_normalized_query(jstate, query,
@@ -1733,11 +1759,17 @@ entry_dealloc(void)
 		entries[i++] = entry;
 		/* "Sticky" entries get a different usage decay rate. */
 		if (entry->counters.calls == 0)
+		{
 			entry->counters.usage *= STICKY_DECREASE_FACTOR;
+			/* Accumulate generic placeholder length into total size. */
+			totlen += pgss->mean_query_len;
+		}
+		{
 			entry->counters.usage *= USAGE_DECREASE_FACTOR;
-		/* Accumulate total size, too. */
-		totlen += entry->query_len + 1;
+			/* Accumulate total size. */
+			totlen += entry->query_len + 1;
+		}
 	qsort(entries, i, sizeof(pgssEntry *), entry_cmp);
@@ -1892,7 +1924,7 @@ qtext_load_file(Size *buffer_size)
 	/* Allocate buffer; beware that off_t might be wider than size_t */
-	if (stat.st_size <= MaxAllocSize)
+	if (stat.st_size <= SIZE_MAX)
 		buf = (char *) malloc(stat.st_size);
 		buf = NULL;
@@ -1984,6 +2016,11 @@ need_gc_qtexts(void)
 	 * for file's large size.  We go to the trouble of maintaining the mean
 	 * query length in order to prevent garbage collection from thrashing
 	 * uselessly.
+	 *
+	 * Note that sticky entries do not contribute to mean query length, and so
+	 * large query texts from a succession of slightly distinct queries that
+	 * fail to finish executing repeatedly cause no problems.  We will not back
+	 * off from garbage collection when that happens.
 	if (extent < pgss->mean_query_len * pgss_max * 2)
 		return false;
@@ -2008,7 +2045,7 @@ gc_qtexts(void)
 	char	   *qbuffer;
 	Size		qbuffer_size;
-	FILE	   *qfile;
+	FILE	   *qfile = NULL;
 	HASH_SEQ_STATUS hash_seq;
 	pgssEntry  *entry;
 	Size		extent;
@@ -2023,12 +2060,14 @@ gc_qtexts(void)
-	 * Load the old texts file.  If we fail (out of memory, for instance) just
-	 * skip the garbage collection.
+	 * Load the old texts file.  If we fail (out of memory, for instance),
+	 * invalidate query texts.  This should be rare.  It seems like a good idea
+	 * to avoid risking an ongoing and non-obvious impact to general response
+	 * time, even at the cost of losing useful instrumentation.
 	qbuffer = qtext_load_file(&qbuffer_size);
 	if (qbuffer == NULL)
-		return;
+		goto gc_fail;
 	 * We overwrite the query texts file in place, so as to reduce the risk of

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to