On 9.9.2014 16:09, Robert Haas wrote: > On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <t...@fuzzy.cz> wrote: >> So I only posted the separate patch for those who want to do a review, >> and then a "complete patch" with both parts combined. But it sure may be >> a bit confusing. > > Let's do this: post each new version of the patches only on this > thread, as a series of patches that can be applied one after another.
OK, attached. Apply in this order 1) dense-alloc-v5.patch 2) hashjoin-nbuckets-v12.patch >>> In ExecChooseHashTableSize(), if buckets_bytes is independent of >>> nbatch, could we just compute it before working out dbatch, and just >>> deduct it from hash_table_bytes? That seems like it'd do the same >>> thing for less work. >> >> Well, we can do that. But I really don't think that's going to make >> measurable difference. And I think the code is more clear this way. > > Really? It seems to me that you're doing more or less the same > calculation that's already been done a second time. It took me 20 > minutes of staring to figure out what it was really doing. Now > admittedly I was a little low on caffeine, but... I reworked this part a bit, to make it easier to understand. Mostly following your recommendations. > >>> Either way, what happens if buckets_bytes is already bigger than >>> hash_table_bytes? >> >> Hm, I don't see how that could happen. > > How about an Assert() and a comment, then? Done. According to the comment, we should never really get over 25%, except maybe for strange work_mem values, because we're keeping nbuckets as 2^N. But even then we should not reach 50%, so I added this assert: Assert(buckets_bytes <= hash_table_bytes/2); And then we subtract the buckets_bytes, and continue with the loop. > >>> A few more stylistic issues that I see: >>> >>> + if ((hashtable->nbatch == 1) && >>> + if (hashtable->nbuckets_optimal <= (INT_MAX/2)) >>> + if (size > (HASH_CHUNK_SIZE/8)) >>> >>> While I'm all in favor of using parentheses generously where it's >>> needed to add clarity, these and similar usages seem over the top to >>> me. >> >> It seemed more readable for me at the time of coding it, and it still >> seems better this way, but ... >> >> https://www.youtube.com/watch?v=CxK_nA2iVXw > > Heh. Well, at any rate, I think PostgreSQL style is to not use > parentheses in that situation. FWIW, I added HASH_CHUNK_THRESHOLD macro, specifying the threshold. I also simplified the conditions a bit, and removed some of the parentheses. > >>> +char * chunk_alloc(HashJoinTable hashtable, int size) { >>> >>> Eh... no. >>> >>> + /* XXX maybe we should use MAXALIGN(size) here ... */ >>> >>> +1. >> >> I'm not sure what's the 'no' pointing at? Maybe that the parenthesis >> should be on the next line? And is the +1 about doing MAXALING? > > The "no" is about the fact that you have the return type, the function > name, and the opening brace on one line instead of three lines as is > project style. The +1 is for applying MAXALIGN to the size. OK, fixed. I also did a few 'minor' changes to the dense allocation patch, most notably: * renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData The original naming seemed a bit awkward. * renamed the function to 'dense_alloc' (instead of 'chunk_alloc') Seems like a better fit to what the function does. * the chunks size is 32kB (instead of 16kB), and we're using 1/4 threshold for 'oversized' items We need the threshold to be >=8kB, to trigger the special case within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION. I also considered Heikki's suggestion that while rebatching, we can reuse chunks with a single large tuple, instead of copying it needlessly. My attempt however made the code much uglier (I almost used a GOTO, but my hands rejected to type it ...). I'll look into that. Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 589b2f1..d5acfb1 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable, int bucketNumber); static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); +static char * dense_alloc(HashJoinTable hashtable, int tupleSize); /* ---------------------------------------------------------------- * ExecHash @@ -294,6 +295,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) hashtable->spaceAllowedSkew = hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100; + hashtable->chunks = NULL; + /* * Get info about the hash functions to be used for each hash key. Also * remember whether the join operators are strict. @@ -556,10 +559,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) int oldnbatch = hashtable->nbatch; int curbatch = hashtable->curbatch; int nbatch; - int i; MemoryContext oldcxt; long ninmemory; long nfreed; + MemoryChunk oldchunks = hashtable->chunks; /* do nothing if we've decided to shut off growth */ if (!hashtable->growEnabled) @@ -612,51 +615,71 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) */ ninmemory = nfreed = 0; - for (i = 0; i < hashtable->nbuckets; i++) + /* + * We will scan through the chunks directly, so we can reset the buckets + * and copy/insert the tuples from scratch. So set all buckets to NULL. + * And throw away the old chunks - we have a pointer in oldchunks, and + * we'll free them once we've copied all the tuples. + */ + memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets); + hashtable->chunks = NULL; + + /* so, let's scan through the old chunks, and all tuples in each chunk */ + while (oldchunks != NULL) { - HashJoinTuple prevtuple; - HashJoinTuple tuple; - prevtuple = NULL; - tuple = hashtable->buckets[i]; + /* pointer to the next memory chunk (may be NULL for the last chunk) */ + MemoryChunk nextchunk = oldchunks->next; + + /* position within the buffer (up to oldchunks->used) */ + size_t idx = 0; - while (tuple != NULL) + /* process all tuples stored in this chunk (and then free it) */ + while (idx < oldchunks->used) { - /* save link in case we delete */ - HashJoinTuple nexttuple = tuple->next; - int bucketno; - int batchno; + + /* get the hashjoin tuple and mintuple stored right after it */ + HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks->data + idx); + MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple); + + int hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len); + + int bucketno; + int batchno; ninmemory++; - ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue, + ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue, &bucketno, &batchno); - Assert(bucketno == i); + if (batchno == curbatch) { - /* keep tuple */ - prevtuple = tuple; + /* keep tuple - this means we need to copy it into the new chunks */ + HashJoinTuple copyTuple = (HashJoinTuple) dense_alloc(hashtable, hashTupleSize); + memcpy(copyTuple, hashTuple, hashTupleSize); + + /* course add it to the apropriate bucket */ + copyTuple->next = hashtable->buckets[bucketno]; + hashtable->buckets[bucketno] = copyTuple; } else { /* dump it out */ Assert(batchno > curbatch); - ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(tuple), - tuple->hashvalue, + ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple), + hashTuple->hashvalue, &hashtable->innerBatchFile[batchno]); - /* and remove from hash table */ - if (prevtuple) - prevtuple->next = nexttuple; - else - hashtable->buckets[i] = nexttuple; - /* prevtuple doesn't change */ - hashtable->spaceUsed -= - HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len; - pfree(tuple); + + hashtable->spaceUsed -= hashTupleSize; nfreed++; } - tuple = nexttuple; + /* bytes occupied in memory HJ tuple overhead + actual tuple length */ + idx += MAXALIGN(hashTupleSize); } + + /* so, we're done with this chunk - free it and proceed to the next one */ + pfree(oldchunks); + oldchunks = nextchunk; } #ifdef HJDEBUG @@ -717,8 +740,8 @@ ExecHashTableInsert(HashJoinTable hashtable, /* Create the HashJoinTuple */ hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len; - hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt, - hashTupleSize); + hashTuple = (HashJoinTuple) dense_alloc(hashtable, hashTupleSize); + hashTuple->hashvalue = hashvalue; memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len); @@ -1068,6 +1091,9 @@ ExecHashTableReset(HashJoinTable hashtable) hashtable->spaceUsed = 0; MemoryContextSwitchTo(oldcxt); + + /* Forget the chunks (the memory was freed by the context reset above). */ + hashtable->chunks = NULL; } /* @@ -1318,6 +1344,69 @@ ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue) return INVALID_SKEW_BUCKET_NO; } +static char * +dense_alloc(HashJoinTable hashtable, int size) +{ + + /* just in case the size is not already aligned properly */ + size = MAXALIGN(size); + + /* if tuple size is larger than of 1/8 of chunk size, allocate a separate chunk */ + if (size > HASH_CHUNK_THRESHOLD) { + + /* allocate new chunk and put it at the beginning of the list */ + MemoryChunk newChunk + = (MemoryChunk)MemoryContextAlloc(hashtable->batchCxt, + offsetof(MemoryChunkData, data) + size); + + newChunk->maxlen = size; + newChunk->used = 0; + newChunk->ntuples = 0; + + /* If there already is a chunk, add if after it so we can still use the space in it */ + if (hashtable->chunks != NULL) { + newChunk->next = hashtable->chunks->next; + hashtable->chunks->next = newChunk; + } else { + newChunk->next = hashtable->chunks; + hashtable->chunks = newChunk; + } + + newChunk->used += size; + newChunk->ntuples += 1; + + return newChunk->data; + + } + + /* + * If it's within the size limit, see if we have enough space for it in the + * current chunk (if there's one). If not, allocate a fresh chunk. + */ + if ((hashtable->chunks == NULL) || (hashtable->chunks->maxlen - hashtable->chunks->used) < size) { + + /* allocate new chunk and put it at the beginning of the list */ + MemoryChunk newChunk + = (MemoryChunk)MemoryContextAlloc(hashtable->batchCxt, + offsetof(MemoryChunkData, data) + HASH_CHUNK_SIZE); + + newChunk->maxlen = HASH_CHUNK_SIZE; + newChunk->used = 0; + newChunk->ntuples = 0; + + newChunk->next = hashtable->chunks; + hashtable->chunks = newChunk; + } + + /* OK, we have enough space in the chunk, let's add the tuple */ + hashtable->chunks->used += size; + hashtable->chunks->ntuples += 1; + + /* allocate pointer to the start of the tuple memory */ + return hashtable->chunks->data + (hashtable->chunks->used - size); +} + + /* * ExecHashSkewTableInsert * diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h index 3beae40..166bb7a 100644 --- a/src/include/executor/hashjoin.h +++ b/src/include/executor/hashjoin.h @@ -102,6 +102,23 @@ typedef struct HashSkewBucket #define SKEW_WORK_MEM_PERCENT 2 #define SKEW_MIN_OUTER_FRACTION 0.01 +/* used for dense allocation of tuples */ +#define HASH_CHUNK_SIZE (32*1024L) +#define HASH_CHUNK_THRESHOLD (HASH_CHUNK_SIZE/4) + +typedef struct MemoryChunkData +{ + int ntuples; /* number of tuples stored in this chunk */ + size_t maxlen; /* length of the buffer */ + size_t used; /* number of chunk bytes already used */ + + struct MemoryChunkData *next; /* pointer to the next chunk (linked list) */ + + char data[1]; /* buffer allocated at the end */ +} MemoryChunkData; + +typedef struct MemoryChunkData* MemoryChunk; + typedef struct HashJoinTableData { @@ -157,6 +174,9 @@ typedef struct HashJoinTableData MemoryContext hashCxt; /* context for whole-hash-join storage */ MemoryContext batchCxt; /* context for this-batch-only storage */ + + /* used for dense allocation of tuples (into linked chunks) */ + MemoryChunk chunks; /* one list for the whole batch */ } HashJoinTableData; #endif /* HASHJOIN_H */
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 781a736..d128e08 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es) if (es->format != EXPLAIN_FORMAT_TEXT) { ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es); + ExplainPropertyLong("Original Hash Buckets", + hashtable->nbuckets_original, es); ExplainPropertyLong("Hash Batches", hashtable->nbatch, es); ExplainPropertyLong("Original Hash Batches", hashtable->nbatch_original, es); ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es); } - else if (hashtable->nbatch_original != hashtable->nbatch) + else if ((hashtable->nbatch_original != hashtable->nbatch) || + (hashtable->nbuckets_original != hashtable->nbuckets)) { appendStringInfoSpaces(es->str, es->indent * 2); appendStringInfo(es->str, - "Buckets: %d Batches: %d (originally %d) Memory Usage: %ldkB\n", - hashtable->nbuckets, hashtable->nbatch, - hashtable->nbatch_original, spacePeakKb); + "Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n", + hashtable->nbuckets, hashtable->nbuckets_original, + hashtable->nbatch, hashtable->nbatch_original, spacePeakKb); } else { diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index d5acfb1..a76e878 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -39,6 +39,7 @@ static void ExecHashIncreaseNumBatches(HashJoinTable hashtable); +static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable); static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node, int mcvsToUse); static void ExecHashSkewTableInsert(HashJoinTable hashtable, @@ -49,6 +50,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable); static char * dense_alloc(HashJoinTable hashtable, int tupleSize); +/* Memory needed for optimal number of buckets. */ +#define BUCKETS_SPACE(htab) ((htab)->nbuckets_optimal * sizeof(HashJoinTuple)) + /* ---------------------------------------------------------------- * ExecHash * @@ -117,6 +121,7 @@ MultiExecHash(HashState *node) /* It's a skew tuple, so put it into that hash table */ ExecHashSkewTableInsert(hashtable, slot, hashvalue, bucketNumber); + hashtable->skewTuples += 1; } else { @@ -127,6 +132,25 @@ MultiExecHash(HashState *node) } } + /* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */ + if (hashtable->nbuckets != hashtable->nbuckets_optimal) + { + /* We never decrease the number of buckets. */ + Assert(hashtable->nbuckets_optimal > hashtable->nbuckets); + +#ifdef HJDEBUG + printf("Increasing nbuckets %d => %d\n", + hashtable->nbuckets, hashtable->nbuckets_optimal); +#endif + + ExecHashIncreaseNumBuckets(hashtable); + } + + /* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */ + hashtable->spaceUsed += BUCKETS_SPACE(hashtable); + if (hashtable->spaceUsed > hashtable->spacePeak) + hashtable->spacePeak = hashtable->spaceUsed; + /* must provide our own instrumentation support */ if (node->ps.instrument) InstrStopNode(node->ps.instrument, hashtable->totalTuples); @@ -272,7 +296,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) */ hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData)); hashtable->nbuckets = nbuckets; + hashtable->nbuckets_original = nbuckets; + hashtable->nbuckets_optimal = nbuckets; hashtable->log2_nbuckets = log2_nbuckets; + hashtable->log2_nbuckets_optimal = log2_nbuckets; hashtable->buckets = NULL; hashtable->keepNulls = keepNulls; hashtable->skewEnabled = false; @@ -286,6 +313,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) hashtable->nbatch_outstart = nbatch; hashtable->growEnabled = true; hashtable->totalTuples = 0; + hashtable->skewTuples = 0; hashtable->innerBatchFile = NULL; hashtable->outerBatchFile = NULL; hashtable->spaceUsed = 0; @@ -387,7 +415,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls) */ /* Target bucket loading (tuples per bucket) */ -#define NTUP_PER_BUCKET 10 +#define NTUP_PER_BUCKET 1 void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, @@ -397,6 +425,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, { int tupsize; double inner_rel_bytes; + long buckets_bytes; long hash_table_bytes; long skew_table_bytes; long max_pointers; @@ -419,6 +448,16 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, inner_rel_bytes = ntuples * tupsize; /* + * Compute memory occupied by buckets, assuming all tuples fit into + * a single batch (consider NTUP_PER_BUCKET tuples per bucket) - buckets + * are just 'HashJoinTuple' elements (pointers to HashJoinTupleData). + * Also, we never use less than 1024 buckets. + */ + nbuckets = (1 << my_log2(ntuples / NTUP_PER_BUCKET)); + nbuckets = Max(1024, nbuckets); + buckets_bytes = sizeof(HashJoinTuple) * nbuckets; + + /* * Target in-memory hashtable size is work_mem kilobytes. */ hash_table_bytes = work_mem * 1024L; @@ -469,16 +508,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, /* also ensure we avoid integer overflow in nbatch and nbuckets */ max_pointers = Min(max_pointers, INT_MAX / 2); - if (inner_rel_bytes > hash_table_bytes) + if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes) { /* We'll need multiple batches */ long lbuckets; double dbatch; int minbatch; - - lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET; - lbuckets = Min(lbuckets, max_pointers); - nbuckets = (int) lbuckets; + long bucket_size; dbatch = ceil(inner_rel_bytes / hash_table_bytes); dbatch = Min(dbatch, max_pointers); @@ -486,6 +522,54 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew, nbatch = 2; while (nbatch < minbatch) nbatch <<= 1; + + /* memory needed by a single "full" bucket (including tuples) */ + bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple)); + + /* + * When batching, we use buckets size for full work_mem. We simply + * divide work_mem by memory needed per 'full bucket' (a pointer and + * NTUP_PER_BUCKET tuples, each 'tupsize' bytes, including additional + * overhead for hash, pointer to the next tuple etc.). + */ + lbuckets = 1 << my_log2(hash_table_bytes / bucket_size); + + /* protect against nbucket overflow */ + lbuckets = Min(lbuckets, max_pointers); + nbuckets = (int) lbuckets; + + /* Compute memory needed for buckets. */ + buckets_bytes = nbuckets * sizeof(HashJoinTuple); + + /* + * Buckets are simple pointers to hashjoin tuples, while tupsize is + * always >= 24B, plus actual data. So buckets should never really + * exceed 25% of work_mem (even for NTUP_PER_BUCKET=1). Except maybe + * for work_mem values that are not 2^N bytes, where we might get more + * because of doubling. So let's look for 50% here. + */ + Assert(buckets_bytes <= hash_table_bytes/2); + + /* + * Subtract the buckets from work_mem, so we know how much memory + * remains for the actual tuples. + */ + hash_table_bytes -= buckets_bytes; + + /* + * Increase the nbatch until we get both tuples and buckets into work_mem. + * + * The loop should not execute more than once in most cases, becase tuples are + * usually much wider than buckets (usually 8B pointers), so by using only + * (batch_bytes/2) should get us below work_mem. + * + * The worst case is that (nbuckets == 2*ntuples-1), giving us about twice the + * number of buckets, i.e. about 2*sizeof(void*) per tuple. But that's + * the consequence of how NTUP_PER_BUCKET is chosen, and work_mem limit. + */ + while ((inner_rel_bytes / nbatch) > hash_table_bytes) + nbatch <<= 1; + } else { @@ -620,7 +704,23 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) * and copy/insert the tuples from scratch. So set all buckets to NULL. * And throw away the old chunks - we have a pointer in oldchunks, and * we'll free them once we've copied all the tuples. + * + * If we need to do a resize of buckets, we can do it while rebatching. */ + + if (hashtable->nbuckets_optimal != hashtable->nbuckets) + { + /* we never decrease the number of buckets */ + Assert(hashtable->nbuckets_optimal > hashtable->nbuckets); + + hashtable->nbuckets = hashtable->nbuckets_optimal; + hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal; + + hashtable->buckets = repalloc(hashtable->buckets, + sizeof(HashJoinTuple) * hashtable->nbuckets); + } + + /* zero the buckets, start allocating into new chunks */ memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets); hashtable->chunks = NULL; @@ -705,6 +805,93 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) } /* + * ExecHashIncreaseNumBuckets + * increase the original number of buckets in order to reduce + * number of tuples per bucket + */ +static void +ExecHashIncreaseNumBuckets(HashJoinTable hashtable) +{ + MemoryChunk chunk; + + /* do nothing if not an increase (it's called increase for a reason) */ + if (hashtable->nbuckets >= hashtable->nbuckets_optimal) + return; + + /* + * We already know the optimal number of buckets, so let's just + * compute the log2_nbuckets for it. + */ + hashtable->nbuckets = hashtable->nbuckets_optimal; + hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal); + + Assert(hashtable->nbuckets > 1); + Assert(hashtable->nbuckets <= (INT_MAX/2)); + Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets)); + +#ifdef HJDEBUG + printf("Increasing nbuckets to %d\n", hashtable->nbuckets); +#endif + + /* + * Just reallocate the proper number of buckets - we don't need to + * walk through them - we can walk the dense-allocated chunks + * (just like in ExecHashIncreaseNumBatches, but without all the + * copying into new chunks) + */ + + hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets, + hashtable->nbuckets * sizeof(HashJoinTuple)); + + memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets); + + /* scan through all tuples in all chunks, rebuild the hash table */ + chunk = hashtable->chunks; + while (chunk != NULL) + { + + /* position within the buffer (up to chunk->used) */ + size_t idx = 0; + + /* process all tuples stored in this chunk (and then free it) */ + while (idx < chunk->used) + { + + /* get the hashjoin tuple and mintuple stored right after it */ + HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx); + MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple); + + int hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len); + + int bucketno; + int batchno; + + ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue, + &bucketno, &batchno); + + /* just add the tuple to the proper bucket */ + hashTuple->next = hashtable->buckets[bucketno]; + hashtable->buckets[bucketno] = hashTuple; + + /* bytes occupied in memory HJ tuple overhead + actual tuple length */ + idx += MAXALIGN(hashTupleSize); + + } + + /* proceed to the next chunk */ + chunk = chunk->next; + + } + +#ifdef HJDEBUG + printf("Nbuckets increased to %d, average items per bucket %.1f\n", + hashtable->nbuckets, batchTuples / hashtable->nbuckets); +#endif + +} + + +/* * ExecHashTableInsert * insert a tuple into the hash table depending on the hash value * it may just go to a temp file for later batches @@ -737,6 +924,7 @@ ExecHashTableInsert(HashJoinTable hashtable, */ HashJoinTuple hashTuple; int hashTupleSize; + double ntuples = (hashtable->totalTuples - hashtable->skewTuples); /* Create the HashJoinTuple */ hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len; @@ -757,12 +945,30 @@ ExecHashTableInsert(HashJoinTable hashtable, hashTuple->next = hashtable->buckets[bucketno]; hashtable->buckets[bucketno] = hashTuple; + /* + * Increase the (optimal) number of buckets if we just exceeded NTUP_PER_BUCKET, + * but only when there's still a single batch at this moment. + */ + if ((hashtable->nbatch == 1) && + (hashtable->nbuckets_optimal <= INT_MAX/2) && /* overflow protection */ + (ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET))) + { + hashtable->nbuckets_optimal *= 2; + hashtable->log2_nbuckets_optimal += 1; + } + /* Account for space used, and back off if we've used too much */ hashtable->spaceUsed += hashTupleSize; if (hashtable->spaceUsed > hashtable->spacePeak) hashtable->spacePeak = hashtable->spaceUsed; - if (hashtable->spaceUsed > hashtable->spaceAllowed) + + /* + * Do we need to increase number of batches? Account for space required + * by buckets (optimal number). + */ + if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed) ExecHashIncreaseNumBatches(hashtable); + } else { @@ -885,7 +1091,10 @@ ExecHashGetHashValue(HashJoinTable hashtable, * functions are good about randomizing all their output bits, else we are * likely to have very skewed bucket or batch occupancy.) * - * nbuckets doesn't change over the course of the join. + * The nbuckets/log2_nbuckets may change while (nbatch==1) because of dynamic + * buckets growth. Once we start batching, the value is fixed and does not + * change over the course of join (making it possible to compute batch number + * the way we do here). * * nbatch is always a power of 2; we increase it only by doubling it. This * effectively adds one more bit to the top of the batchno. @@ -1351,7 +1560,7 @@ dense_alloc(HashJoinTable hashtable, int size) /* just in case the size is not already aligned properly */ size = MAXALIGN(size); - /* if tuple size is larger than of 1/8 of chunk size, allocate a separate chunk */ + /* if tuple size is larger than of 1/4 of chunk size, allocate a separate chunk */ if (size > HASH_CHUNK_THRESHOLD) { /* allocate new chunk and put it at the beginning of the list */ diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h index 166bb7a..f0775ea 100644 --- a/src/include/executor/hashjoin.h +++ b/src/include/executor/hashjoin.h @@ -125,6 +125,10 @@ typedef struct HashJoinTableData int nbuckets; /* # buckets in the in-memory hash table */ int log2_nbuckets; /* its log2 (nbuckets must be a power of 2) */ + int nbuckets_original; /* # buckets when starting the first hash */ + int nbuckets_optimal; /* optimal # buckets (per batch) */ + int log2_nbuckets_optimal; /* same as log2_nbuckets optimal */ + /* buckets[i] is head of list of tuples in i'th in-memory bucket */ struct HashJoinTupleData **buckets; /* buckets array is per-batch storage, as are all the tuples */ @@ -146,6 +150,7 @@ typedef struct HashJoinTableData bool growEnabled; /* flag to shut off nbatch increases */ double totalTuples; /* # tuples obtained from inner plan */ + double skewTuples; /* # tuples inserted into skew tuples */ /* * These arrays are allocated for the life of the hash join, but only if
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers