On 3.7.2014 19:36, Tomas Vondra wrote: > On 1.7.2014 01:24, Tomas Vondra wrote: >> On 30.6.2014 23:12, Tomas Vondra wrote: >>> Hi, >> >> Hopefully I got it right this time. At least it seems to be working >> for cases that failed before (no file leaks, proper rowcounts so >> far). > > Attached v7, fixing nbatch/ntuples in an assert. > > regards > Tomas
Attached is v8 of this patch, significantly reworked based on the related discussion (mostly in 'tweaking NTUP_PER_BUCKET'). With the patch applied, the hash table works like this: initial sizing (ExecChooseHashTableSize) ---------------------------------------- - size nbuckets for NTUP_PER_BUCKET=1 - account for memory allocated for buckets building the hash table ----------------------- - while adding tuples, keep track of optimal number of buckets (for the current number of tuples per batch) - don't resize until all the tuples are added (by increasing nbatch the number of buckets may decrease) - after adding all tuples, consider resizing (optimal nbuckets != initial nbuckets) - do the resize This means that for well estimated plans (reasonably exact tuple count for the inner relation), the hash table is properly sized and no resize is needed. For badly estimated plans, this works about the same as the previous patches (we're accounting for memory needed for buckets etc.). More batches or less buckets? ----------------------------- In the related threads, I repeatedly mentioned that I don't know a good answer to "Should I add more batches or decrease the number of buckets?" while sizing the hash table. The current code (as in HEAD) does not face this dillema because (a) it does not count buckets into work_mem and (b) does not allow changing nbuckets. I still don't have a good answer to this question, but I think the patch can stand even without it. If you want more/less batches, use smaller/larger work_mem - same as with the current code. Also, with the 'dense allocation' patch [1], it's possible to use larger work_mem values (and thus compensate for counting buckets into work_mem). Changes since v7 ---------------- (a) remove NTUP_GROW_THRESHOLD (just use NTUP_PER_BUCKET directly) (b) set NTUP_PER_BUCKET=1 (c) include buckets (optimal) when considering nbatch increase (d) track optimal number of buckets (for NTUP_PER_BUCKET) (e) after loading all the tuples, resize the hash table to get nbuckets_optimal (if needed) (f) renamed GUC to enable_hash_resize (makes a bit more sense) Comments -------- (a) NTUP_GROW_THRESHOLD was overcomplicated and mostly a result of misunderstanding how NTUP_PER_BUCKET works (it's upper threshold, not average) and is not really needed. (b) The value "1" gives the best performance, but also requires more memory for the buckets. This should be more than compensated by densely allocating the tuples (see hashjoin-alloc-v3.patch in the "tweaking NTUP_PER_BUCKET" thread [1]). (c,d) Originally, the memory needed for buckets was not included in 'spaceUsed', but because the NTUP_PER_BUCKET=1 change this is not really acceptable (needs much more memory than before). So now it's included both in the initial sizing of the hash table, and when adding more tuples (nbuckets_optimal). Still, spaceUsed does not include palloc overhead (which may be a significant amount of memory). This is tackled by [1]. (e) No change here. (f) A bit better GUC name. Anyway, this is for easier development only, and won't be included in the final patch. regards Tomas [1] http://www.postgresql.org/message-id/53c2decc.2010...@fuzzy.cz
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 0d9663c..db3a953 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es) if (es->format != EXPLAIN_FORMAT_TEXT) { ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es); + ExplainPropertyLong("Original Hash Buckets", + hashtable->nbuckets_original, es); ExplainPropertyLong("Hash Batches", hashtable->nbatch, es); ExplainPropertyLong("Original Hash Batches", hashtable->nbatch_original, es); ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es); } - else if (hashtable->nbatch_original != hashtable->nbatch) + else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets)) { appendStringInfoSpaces(es->str, es->indent * 2); appendStringInfo(es->str, - "Buckets: %d Batches: %d (originally %d) Memory Usage: %ldkB\n", - hashtable->nbuckets, hashtable->nbatch, - hashtable->nbatch_original, spacePeakKb); + "Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n", + hashtable->nbuckets, hashtable->nbuckets_original, + hashtable->nbatch, hashtable->nbatch_original, spacePeakKb); } else { diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 589b2f1..1befed5 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -37,8 +37,10 @@ #include "utils/lsyscache.h" #include "utils/syscache.h" +bool enable_hash_resize = true; static void ExecHashIncreaseNumBatches(HashJoinTable hashtable); +static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable); static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node, int mcvsToUse); static void ExecHashSkewTableInsert(HashJoinTable hashtable, @@ -48,6 +50,9 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable, static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); +/* Memory needed for optimal number of buckets. */ +#define BUCKETS_SPACE(htab) ((htab)->nbuckets_optimal * sizeof(void*)) + /* ---------------------------------------------------------------- * ExecHash * @@ -116,6 +121,7 @@ MultiExecHash(HashState *node) /* It's a skew tuple, so put it into that hash table */ ExecHashSkewTableInsert(hashtable, slot, hashvalue, bucketNumber); + hashtable->skewTuples += 1; } else { @@ -126,6 +132,31 @@ MultiExecHash(HashState *node) } } + /* If average number of tuples per bucket is over the defined threshold, + * increase the number of buckets to get below it. */ + if (enable_hash_resize) { + + elog(WARNING, "nbuckets=%d optimal=%d tuples=%.0f", hashtable->nbuckets, hashtable->nbuckets_optimal, + hashtable->totalTuples - hashtable->skewTuples); + + /* Do we actually need to resize? */ + if (hashtable->nbuckets != hashtable->nbuckets_optimal) { + +#ifdef HJDEBUG + printf("Increasing nbuckets %d => %d\n", + hashtable->nbuckets, hashtable->nbuckets_optimal); +#endif + ExecHashIncreaseNumBuckets(hashtable); + } + } + + elog(WARNING, "buckets space = %ld", BUCKETS_SPACE(hashtable)); + + /* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */ + hashtable->spaceUsed += BUCKETS_SPACE(hashtable); + if (hashtable->spaceUsed > hashtable->spacePeak) + hashtable->spacePeak = hashtable->spaceUsed; + /* must provide our own instrumentation support */ if (node->ps.instrument) InstrStopNode(node->ps.instrument, hashtable->totalTuples); @@ -239,6 +270,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) int nbatch; int num_skew_mcvs; int log2_nbuckets; + int log2_nbatch; int nkeys; int i; ListCell *ho; @@ -263,6 +295,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) log2_nbuckets = my_log2(nbuckets); Assert(nbuckets == (1 << log2_nbuckets)); + /* nbatch must be a power of 2 */ + log2_nbatch = my_log2(nbatch); + Assert(nbatch == (1 << log2_nbatch)); + /* * Initialize the hash table control block. * @@ -271,6 +307,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) */ hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData)); hashtable->nbuckets = nbuckets; + hashtable->nbuckets_original = nbuckets; + hashtable->nbuckets_optimal = nbuckets; hashtable->log2_nbuckets = log2_nbuckets; hashtable->buckets = NULL; hashtable->keepNulls = keepNulls; @@ -281,10 +319,12 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) hashtable->skewBucketNums = NULL; hashtable->nbatch = nbatch; hashtable->curbatch = 0; + hashtable->log2_nbatch = log2_nbatch; hashtable->nbatch_original = nbatch; hashtable->nbatch_outstart = nbatch; hashtable->growEnabled = true; hashtable->totalTuples = 0; + hashtable->skewTuples = 0; hashtable->innerBatchFile = NULL; hashtable->outerBatchFile = NULL; hashtable->spaceUsed = 0; @@ -375,7 +415,6 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) return hashtable; } - /* * Compute appropriate size for hashtable given the estimated size of the * relation to be hashed (number of rows and average row width). @@ -384,7 +423,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) */ /* Target bucket loading (tuples per bucket) */ -#define NTUP_PER_BUCKET 10 +#define NTUP_PER_BUCKET 1 void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, @@ -394,6 +433,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, { int tupsize; double inner_rel_bytes; + double buckets_bytes; long hash_table_bytes; long skew_table_bytes; long max_pointers; @@ -415,6 +455,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, MAXALIGN(tupwidth); inner_rel_bytes = ntuples * tupsize; + /* Estimate memory needed for buckets, with the target NTUP_PER_BUCKET */ + buckets_bytes = sizeof(void*) * my_log2(ntuples / NTUP_PER_BUCKET); + /* * Target in-memory hashtable size is work_mem kilobytes. */ @@ -466,23 +509,48 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, /* also ensure we avoid integer overflow in nbatch and nbuckets */ max_pointers = Min(max_pointers, INT_MAX / 2); - if (inner_rel_bytes > hash_table_bytes) + if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes) { /* We'll need multiple batches */ long lbuckets; double dbatch; int minbatch; - - lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET; - lbuckets = Min(lbuckets, max_pointers); - nbuckets = (int) lbuckets; + double batch_bytes; dbatch = ceil(inner_rel_bytes / hash_table_bytes); dbatch = Min(dbatch, max_pointers); minbatch = (int) dbatch; nbatch = 2; + + /* increase nbatch until the hashtable fits into work_mem */ while (nbatch < minbatch) nbatch <<= 1; + + /* now, we need to see if the buckets fit into work_mem too */ + lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET)); + + /* assuming equally-sized batches, compute bytes of hash table and buckets */ + batch_bytes = inner_rel_bytes / nbatch; + buckets_bytes = lbuckets * sizeof(void*); + + /* we need to get both into work_mem */ + while (batch_bytes + buckets_bytes > hash_table_bytes) { + + /* increment number of batches, and re-evaluate the amount of memory */ + nbatch <<= 1; + + /* now, we need to see if the buckets fit into work_mem too */ + lbuckets = 1 << (my_log2(ntuples / nbatch / NTUP_PER_BUCKET)); + + /* assuming equally-sized batches, compute bytes of hash table and buckets */ + batch_bytes = inner_rel_bytes / nbatch; + buckets_bytes = lbuckets * sizeof(void*); + + } + + /* protect against nbucket overflow */ + lbuckets = Min(lbuckets, max_pointers); + nbuckets = (int) lbuckets; } else { @@ -560,6 +628,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) MemoryContext oldcxt; long ninmemory; long nfreed; + double batchTuples; /* do nothing if we've decided to shut off growth */ if (!hashtable->growEnabled) @@ -605,6 +674,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) MemoryContextSwitchTo(oldcxt); hashtable->nbatch = nbatch; + hashtable->log2_nbatch += 1; /* * Scan through the existing hash table entries and dump out any that are @@ -679,8 +749,109 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) printf("Disabling further increase of nbatch\n"); #endif } + + /* + * After increasing the number of batches, the number of tuples per batch + * is lower (~1/2), so the optimal number of buckets might be lower too. + * It's probably (nbuckets_optimal/2) but let's compute it anyway. + */ + hashtable->nbuckets_optimal = 10; + batchTuples = (hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch; + while ((batchTuples / hashtable->nbuckets_optimal > NTUP_PER_BUCKET) + && (hashtable->nbuckets_optimal <= (INT_MAX/2))) { + hashtable->nbuckets_optimal *= 2; + } +} + +/* + * ExecHashIncreaseNumBuckets + * increase the original number of buckets in order to reduce + * number of tuples per bucket + */ +static void +ExecHashIncreaseNumBuckets(HashJoinTable hashtable) +{ + int i; + int oldnbuckets = hashtable->nbuckets; + HashJoinTuple *oldbuckets = hashtable->buckets; + MemoryContext oldcxt; + double batchTuples = (hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch; + + /* + * We already know the optimal number of buckets, so let's just + * compute the log2_nbuckets for it. + */ + hashtable->nbuckets = hashtable->nbuckets_optimal; + hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal); + + /* XXX Not sure if we should update the info about used space here. + * The code seems to ignore the space used for 'buckets' and we're not + * allocating more space for tuples (just shuffling them to the new + * buckets). And the amount of memory used for buckets is quite small + * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */ + + /* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET) + * reaches work_mem (or something like that)? We shouldn't really + * get into such position (should be handled by increasing the + * number of batches, which is called right before this). */ + + /* XXX Maybe adding info into hashjoin explain output (e.g. initial + * nbuckets, time spent growing the table) would be appropriate. */ + + Assert(hashtable->nbuckets > 1); + Assert(hashtable->nbuckets <= (INT_MAX/2)); + Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets)); + +#ifdef HJDEBUG + printf("Increasing nbuckets to %d\n", hashtable->nbuckets); +#endif + + /* TODO Maybe it'd be better to resize the buckets in place (should be possible, + * but when I tried it I always ended up with a strange infinite loop). */ + + /* allocate a new bucket list (use the batch context as before) */ + oldcxt = MemoryContextSwitchTo(hashtable->batchCxt); + + hashtable->buckets = (HashJoinTuple *) + palloc0(hashtable->nbuckets * sizeof(HashJoinTuple)); + + MemoryContextSwitchTo(oldcxt); + + /* walk through the old buckets, move the buckets into the new table */ + for (i = 0; i < oldnbuckets; i++) + { + + HashJoinTuple tuple = oldbuckets[i]; + + while (tuple != NULL) + { + /* save link in case we delete */ + HashJoinTuple nexttuple = tuple->next; + int bucketno; + int batchno; + + ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue, + &bucketno, &batchno); + + /* move it to the correct bucket (in the new bucket array) */ + tuple->next = hashtable->buckets[bucketno]; + hashtable->buckets[bucketno] = tuple; + + /* process the next tuple */ + tuple = nexttuple; + } + } + + pfree(oldbuckets); + +#ifdef HJDEBUG + printf("Nbuckets increased to %d, average items per bucket %.1f\n", + hashtable->nbuckets, batchTuples / hashtable->nbuckets); +#endif + } + /* * ExecHashTableInsert * insert a tuple into the hash table depending on the hash value @@ -734,12 +905,26 @@ ExecHashTableInsert(HashJoinTable hashtable, hashTuple->next = hashtable->buckets[bucketno]; hashtable->buckets[bucketno] = hashTuple; + /* Increase the number of buckets (optimal) if we exceeded NTUP_PER_BUCKET. */ + if (((hashtable->totalTuples - hashtable->skewTuples) / hashtable->nbatch) + >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)) { + + /* overflow protection */ + if (hashtable->nbuckets_optimal <= (INT_MAX/2)) + hashtable->nbuckets_optimal *= 2; + + } + /* Account for space used, and back off if we've used too much */ hashtable->spaceUsed += hashTupleSize; if (hashtable->spaceUsed > hashtable->spacePeak) hashtable->spacePeak = hashtable->spaceUsed; - if (hashtable->spaceUsed > hashtable->spaceAllowed) + + /* Do we need to increase number of batches? Account for space required + * by buckets (optimal number). */ + if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed) ExecHashIncreaseNumBatches(hashtable); + } else { @@ -856,14 +1041,12 @@ ExecHashGetHashValue(HashJoinTable hashtable, * chains), and must only cause the batch number to remain the same or * increase. Our algorithm is * bucketno = hashvalue MOD nbuckets - * batchno = (hashvalue DIV nbuckets) MOD nbatch + * batchno = highers log2(nbatch) bits * where nbuckets and nbatch are both expected to be powers of 2, so we can * do the computations by shifting and masking. (This assumes that all hash * functions are good about randomizing all their output bits, else we are * likely to have very skewed bucket or batch occupancy.) * - * nbuckets doesn't change over the course of the join. - * * nbatch is always a power of 2; we increase it only by doubling it. This * effectively adds one more bit to the top of the batchno. */ @@ -880,7 +1063,7 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable, { /* we can do MOD by masking, DIV by shifting */ *bucketno = hashvalue & (nbuckets - 1); - *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1); + *batchno = (hashvalue >> (32 - hashtable->log2_nbatch)); } else { diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 3a31a75..f1cce9f 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] = NULL, NULL, NULL }, { + {"enable_hash_resize", PGC_USERSET, QUERY_TUNING_METHOD, + gettext_noop("Enables dynamic increase of hash buckets."), + NULL + }, + &enable_hash_resize, + true, + NULL, NULL, NULL + }, + { {"geqo", PGC_USERSET, QUERY_TUNING_GEQO, gettext_noop("Enables genetic query optimization."), gettext_noop("This algorithm attempts to do planning without " diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h index 3beae40..acabb48 100644 --- a/src/include/executor/hashjoin.h +++ b/src/include/executor/hashjoin.h @@ -106,8 +106,11 @@ typedef struct HashSkewBucket typedef struct HashJoinTableData { int nbuckets; /* # buckets in the in-memory hash table */ + int nbuckets_original; /* # buckets when starting the first hash */ int log2_nbuckets; /* its log2 (nbuckets must be a power of 2) */ + int nbuckets_optimal; /* optimal number of buckets (for batch) */ + /* buckets[i] is head of list of tuples in i'th in-memory bucket */ struct HashJoinTupleData **buckets; /* buckets array is per-batch storage, as are all the tuples */ @@ -122,6 +125,7 @@ typedef struct HashJoinTableData int nbatch; /* number of batches */ int curbatch; /* current batch #; 0 during 1st pass */ + int log2_nbatch; /* its log2 (nbuckets must be a power of 2) */ int nbatch_original; /* nbatch when we started inner scan */ int nbatch_outstart; /* nbatch when we started outer scan */ @@ -129,6 +133,7 @@ typedef struct HashJoinTableData bool growEnabled; /* flag to shut off nbatch increases */ double totalTuples; /* # tuples obtained from inner plan */ + double skewTuples; /* # tuples inserted into skew tuples */ /* * These arrays are allocated for the life of the hash join, but only if diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h index 75e2afb..e7c5f50 100644 --- a/src/include/optimizer/cost.h +++ b/src/include/optimizer/cost.h @@ -62,6 +62,7 @@ extern bool enable_material; extern bool enable_mergejoin; extern bool enable_hashjoin; extern int constraint_exclusion; +extern bool enable_hash_resize; extern double clamp_row_est(double nrows); extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers