Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Tomas Vondra Sun, 13 Jul 2014 13:44:20 -0700

On 3.7.2014 19:36, Tomas Vondra wrote:
> On 1.7.2014 01:24, Tomas Vondra wrote:
>> On 30.6.2014 23:12, Tomas Vondra wrote:
>>> Hi,
>>
>> Hopefully I got it right this time. At least it seems to be working
>> for cases that failed before (no file leaks, proper rowcounts so
>> far).
> 
> Attached v7, fixing nbatch/ntuples in an assert.
> 
> regards
> Tomas


Attached is v8 of this patch, significantly reworked based on the
related discussion (mostly in 'tweaking NTUP_PER_BUCKET').

With the patch applied, the hash table works like this:

initial sizing (ExecChooseHashTableSize)
----------------------------------------
- size nbuckets for NTUP_PER_BUCKET=1
- account for memory allocated for buckets

building the hash table
-----------------------
- while adding tuples, keep track of optimal number of buckets (for the
  current number of tuples per batch)
- don't resize until all the tuples are added (by increasing nbatch the
  number of buckets may decrease)
- after adding all tuples, consider resizing (optimal nbuckets !=
  initial nbuckets)
- do the resize

This means that for well estimated plans (reasonably exact tuple count
for the inner relation), the hash table is properly sized and no resize
is needed.

For badly estimated plans, this works about the same as the previous
patches (we're accounting for memory needed for buckets etc.).


More batches or less buckets?
-----------------------------
In the related threads, I repeatedly mentioned that I don't know a good
answer to "Should I add more batches or decrease the number of buckets?"
while sizing the hash table. The current code (as in HEAD) does not face
this dillema because (a) it does not count buckets into work_mem and (b)
does not allow changing nbuckets.

I still don't have a good answer to this question, but I think the patch
can stand even without it. If you want more/less batches, use
smaller/larger work_mem - same as with the current code. Also, with the
'dense allocation' patch [1], it's possible to use larger work_mem
values (and thus compensate for counting buckets into work_mem).


Changes since v7
----------------

  (a) remove NTUP_GROW_THRESHOLD (just use NTUP_PER_BUCKET directly)
  (b) set NTUP_PER_BUCKET=1
  (c) include buckets (optimal) when considering nbatch increase
  (d) track optimal number of buckets (for NTUP_PER_BUCKET)
  (e) after loading all the tuples, resize the hash table to get
      nbuckets_optimal (if needed)
  (f) renamed GUC to enable_hash_resize (makes a bit more sense)


Comments
--------

  (a) NTUP_GROW_THRESHOLD was overcomplicated and mostly a result of
      misunderstanding how NTUP_PER_BUCKET works (it's upper threshold,
      not average) and is not really needed.

  (b) The value "1" gives the best performance, but also requires more
      memory for the buckets. This should be more than compensated by
      densely allocating the tuples (see hashjoin-alloc-v3.patch
      in the "tweaking NTUP_PER_BUCKET" thread [1]).

  (c,d) Originally, the memory needed for buckets was not included in
        'spaceUsed', but because the NTUP_PER_BUCKET=1 change this is
        not really acceptable (needs much more memory than before).
        So now it's included both in the initial sizing of the hash
        table, and when adding more tuples (nbuckets_optimal).

        Still, spaceUsed does not include palloc overhead (which may be
        a significant amount of memory). This is tackled by [1].

  (e) No change here.

  (f) A bit better GUC name. Anyway, this is for easier development
      only, and won't be included in the final patch.


regards
Tomas

[1] http://www.postgresql.org/message-id/53c2decc.2010...@fuzzy.cz

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..1befed5 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hash_resize = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -48,6 +50,9 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(void*))
+
 /* ----------------------------------------------------------------
  *		ExecHash
  *
@@ -116,6 +121,7 @@ MultiExecHash(HashState *node)
 				/* It's a skew tuple, so put it into that hash table */
 				ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 										bucketNumber);
+				hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,6 +132,31 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hash_resize) {
+
+		elog(WARNING, "nbuckets=%d optimal=%d tuples=%.0f", hashtable->nbuckets, hashtable->nbuckets_optimal,
+			hashtable->totalTuples - hashtable->skewTuples);
+
+		/* Do we actually need to resize? */
+		if (hashtable->nbuckets != hashtable->nbuckets_optimal) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbuckets %d => %d\n",
+				   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+	}
+
+	elog(WARNING, "buckets space = %ld", BUCKETS_SPACE(hashtable));
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -239,6 +270,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +295,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nbatch must be a power of 2 */
+	log2_nbatch = my_log2(nbatch);
+	Assert(nbatch == (1 << log2_nbatch));
+
 	/*
 	 * Initialize the hash table control block.
 	 *
@@ -271,6 +307,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
+	hashtable->nbuckets_optimal = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -281,10 +319,12 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->skewBucketNums = NULL;
 	hashtable->nbatch = nbatch;
 	hashtable->curbatch = 0;
+	hashtable->log2_nbatch = log2_nbatch;
 	hashtable->nbatch_original = nbatch;
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->skewTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -375,7 +415,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	return hashtable;
 }
 
-
 /*
  * Compute appropriate size for hashtable given the estimated size of the
  * relation to be hashed (number of rows and average row width).
@@ -384,7 +423,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			1
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -394,6 +433,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 {
 	int			tupsize;
 	double		inner_rel_bytes;
+	double		buckets_bytes;
 	long		hash_table_bytes;
 	long		skew_table_bytes;
 	long		max_pointers;
@@ -415,6 +455,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 		MAXALIGN(tupwidth);
 	inner_rel_bytes = ntuples * tupsize;
 
+	/* Estimate memory needed for buckets, with the target NTUP_PER_BUCKET */
+	buckets_bytes = sizeof(void*) * my_log2(ntuples / NTUP_PER_BUCKET);
+
 	/*
 	 * Target in-memory hashtable size is work_mem kilobytes.
 	 */
@@ -466,23 +509,48 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
 
-	if (inner_rel_bytes > hash_table_bytes)
+	if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
 	{
 		/* We'll need multiple batches */
 		long		lbuckets;
 		double		dbatch;
 		int			minbatch;
-
-		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
-		lbuckets = Min(lbuckets, max_pointers);
-		nbuckets = (int) lbuckets;
+		double		batch_bytes;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
 		dbatch = Min(dbatch, max_pointers);
 		minbatch = (int) dbatch;
 		nbatch = 2;
+
+		/* increase nbatch until the hashtable fits into work_mem */
 		while (nbatch < minbatch)
 			nbatch <<= 1;
+
+		/* now, we need to see if the buckets fit into work_mem too */
+		lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET));
+
+		/* assuming equally-sized batches, compute bytes of hash table and buckets */
+		batch_bytes = inner_rel_bytes / nbatch;
+		buckets_bytes = lbuckets * sizeof(void*);
+
+		/* we need to get both into work_mem */
+		while (batch_bytes + buckets_bytes > hash_table_bytes) {
+
+			/* increment number of batches, and re-evaluate the amount of memory */
+			nbatch <<= 1;
+
+			/* now, we need to see if the buckets fit into work_mem too */
+			lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET));
+
+			/* assuming equally-sized batches, compute bytes of hash table and buckets */
+			batch_bytes = inner_rel_bytes / nbatch;
+			buckets_bytes = lbuckets * sizeof(void*);
+
+		}
+
+		/* protect against nbucket overflow */
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
 	}
 	else
 	{
@@ -560,6 +628,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	double		batchTuples;
 
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable->growEnabled)
@@ -605,6 +674,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	MemoryContextSwitchTo(oldcxt);
 
 	hashtable->nbatch = nbatch;
+	hashtable->log2_nbatch += 1;
 
 	/*
 	 * Scan through the existing hash table entries and dump out any that are
@@ -679,8 +749,109 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		printf("Disabling further increase of nbatch\n");
 #endif
 	}
+
+	/*
+	 * After increasing the number of batches, the number of tuples per batch
+	 * is lower (~1/2), so the optimal number of buckets might be lower too.
+	 * It's probably (nbuckets_optimal/2) but let's compute it anyway.
+	 */
+	hashtable->nbuckets_optimal = 10;
+	batchTuples = (hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch;
+	while ((batchTuples / hashtable->nbuckets_optimal > NTUP_PER_BUCKET)
+		&& (hashtable->nbuckets_optimal <= (INT_MAX/2))) {
+		hashtable->nbuckets_optimal *= 2;
+	}
+}
+
+/*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+	double		batchTuples = (hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch;
+
+	/*
+	 * We already know the optimal number of buckets, so let's just
+	 * compute the log2_nbuckets for it.
+	 */
+	hashtable->nbuckets = hashtable->nbuckets_optimal;
+	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets <= (INT_MAX/2));
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *)
+		palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket (in the new bucket array) */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+	
 }
 
+
 /*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
@@ -734,12 +905,26 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* Increase the number of buckets (optimal) if we exceeded NTUP_PER_BUCKET. */
+		if (((hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch)
+			>= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)) {
+
+			/* overflow protection */
+			if (hashtable->nbuckets_optimal <= (INT_MAX/2))
+				hashtable->nbuckets_optimal *= 2;
+
+		}
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
-		if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+		/* Do we need to increase number of batches? Account for space required
+		 * by buckets (optimal number). */
+		if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
 	}
 	else
 	{
@@ -856,14 +1041,12 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * chains), and must only cause the batch number to remain the same or
  * increase.  Our algorithm is
  *		bucketno = hashvalue MOD nbuckets
- *		batchno = (hashvalue DIV nbuckets) MOD nbatch
+ *		batchno =  highers log2(nbatch) bits
  * where nbuckets and nbatch are both expected to be powers of 2, so we can
  * do the computations by shifting and masking.  (This assumes that all hash
  * functions are good about randomizing all their output bits, else we are
  * likely to have very skewed bucket or batch occupancy.)
  *
- * nbuckets doesn't change over the course of the join.
- *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
  * effectively adds one more bit to the top of the batchno.
  */
@@ -880,7 +1063,7 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 	{
 		/* we can do MOD by masking, DIV by shifting */
 		*bucketno = hashvalue & (nbuckets - 1);
-		*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
+		*batchno = (hashvalue >> (32 - hashtable->log2_nbatch));
 	}
 	else
 	{
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a31a75..f1cce9f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hash_resize", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables dynamic increase of hash buckets."),
+			NULL
+		},
+		&enable_hash_resize,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..acabb48 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,8 +106,11 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
+	int			nbuckets_optimal;	/* optimal number of buckets (for batch) */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
@@ -122,6 +125,7 @@ typedef struct HashJoinTableData
 
 	int			nbatch;			/* number of batches */
 	int			curbatch;		/* current batch #; 0 during 1st pass */
+	int			log2_nbatch;	/* its log2 (nbuckets must be a power of 2) */
 
 	int			nbatch_original;	/* nbatch when we started inner scan */
 	int			nbatch_outstart;	/* nbatch when we started outer scan */
@@ -129,6 +133,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	double		skewTuples;		/* # tuples inserted into skew tuples */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..e7c5f50 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -62,6 +62,7 @@ extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern int	constraint_exclusion;
+extern bool enable_hash_resize;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Reply via email to