On 9.9.2014 16:09, Robert Haas wrote:
> On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <[email protected]> wrote:
>> So I only posted the separate patch for those who want to do a review,
>> and then a "complete patch" with both parts combined. But it sure may be
>> a bit confusing.
>
> Let's do this: post each new version of the patches only on this
> thread, as a series of patches that can be applied one after another.
OK, attached. Apply in this order
1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch
>>> In ExecChooseHashTableSize(), if buckets_bytes is independent of
>>> nbatch, could we just compute it before working out dbatch, and just
>>> deduct it from hash_table_bytes? That seems like it'd do the same
>>> thing for less work.
>>
>> Well, we can do that. But I really don't think that's going to make
>> measurable difference. And I think the code is more clear this way.
>
> Really? It seems to me that you're doing more or less the same
> calculation that's already been done a second time. It took me 20
> minutes of staring to figure out what it was really doing. Now
> admittedly I was a little low on caffeine, but...
I reworked this part a bit, to make it easier to understand. Mostly
following your recommendations.
>
>>> Either way, what happens if buckets_bytes is already bigger than
>>> hash_table_bytes?
>>
>> Hm, I don't see how that could happen.
>
> How about an Assert() and a comment, then?
Done. According to the comment, we should never really get over 25%,
except maybe for strange work_mem values, because we're keeping nbuckets
as 2^N. But even then we should not reach 50%, so I added this assert:
Assert(buckets_bytes <= hash_table_bytes/2);
And then we subtract the buckets_bytes, and continue with the loop.
>
>>> A few more stylistic issues that I see:
>>>
>>> + if ((hashtable->nbatch == 1) &&
>>> + if (hashtable->nbuckets_optimal <= (INT_MAX/2))
>>> + if (size > (HASH_CHUNK_SIZE/8))
>>>
>>> While I'm all in favor of using parentheses generously where it's
>>> needed to add clarity, these and similar usages seem over the top to
>>> me.
>>
>> It seemed more readable for me at the time of coding it, and it still
>> seems better this way, but ...
>>
>> https://www.youtube.com/watch?v=CxK_nA2iVXw
>
> Heh. Well, at any rate, I think PostgreSQL style is to not use
> parentheses in that situation.
FWIW, I added HASH_CHUNK_THRESHOLD macro, specifying the threshold. I
also simplified the conditions a bit, and removed some of the parentheses.
>
>>> +char * chunk_alloc(HashJoinTable hashtable, int size) {
>>>
>>> Eh... no.
>>>
>>> + /* XXX maybe we should use MAXALIGN(size) here ... */
>>>
>>> +1.
>>
>> I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
>> should be on the next line? And is the +1 about doing MAXALING?
>
> The "no" is about the fact that you have the return type, the function
> name, and the opening brace on one line instead of three lines as is
> project style. The +1 is for applying MAXALIGN to the size.
OK, fixed.
I also did a few 'minor' changes to the dense allocation patch, most
notably:
* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.
* renamed the function to 'dense_alloc' (instead of 'chunk_alloc')
Seems like a better fit to what the function does.
* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items
We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.
I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..d5acfb1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
int bucketNumber);
static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
+static char * dense_alloc(HashJoinTable hashtable, int tupleSize);
/* ----------------------------------------------------------------
* ExecHash
@@ -294,6 +295,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
hashtable->spaceAllowedSkew =
hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
+ hashtable->chunks = NULL;
+
/*
* Get info about the hash functions to be used for each hash key. Also
* remember whether the join operators are strict.
@@ -556,10 +559,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
int oldnbatch = hashtable->nbatch;
int curbatch = hashtable->curbatch;
int nbatch;
- int i;
MemoryContext oldcxt;
long ninmemory;
long nfreed;
+ MemoryChunk oldchunks = hashtable->chunks;
/* do nothing if we've decided to shut off growth */
if (!hashtable->growEnabled)
@@ -612,51 +615,71 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
*/
ninmemory = nfreed = 0;
- for (i = 0; i < hashtable->nbuckets; i++)
+ /*
+ * We will scan through the chunks directly, so we can reset the buckets
+ * and copy/insert the tuples from scratch. So set all buckets to NULL.
+ * And throw away the old chunks - we have a pointer in oldchunks, and
+ * we'll free them once we've copied all the tuples.
+ */
+ memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+ hashtable->chunks = NULL;
+
+ /* so, let's scan through the old chunks, and all tuples in each chunk */
+ while (oldchunks != NULL)
{
- HashJoinTuple prevtuple;
- HashJoinTuple tuple;
- prevtuple = NULL;
- tuple = hashtable->buckets[i];
+ /* pointer to the next memory chunk (may be NULL for the last chunk) */
+ MemoryChunk nextchunk = oldchunks->next;
+
+ /* position within the buffer (up to oldchunks->used) */
+ size_t idx = 0;
- while (tuple != NULL)
+ /* process all tuples stored in this chunk (and then free it) */
+ while (idx < oldchunks->used)
{
- /* save link in case we delete */
- HashJoinTuple nexttuple = tuple->next;
- int bucketno;
- int batchno;
+
+ /* get the hashjoin tuple and mintuple stored right after it */
+ HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks->data + idx);
+ MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+ int hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+ int bucketno;
+ int batchno;
ninmemory++;
- ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+ ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
&bucketno, &batchno);
- Assert(bucketno == i);
+
if (batchno == curbatch)
{
- /* keep tuple */
- prevtuple = tuple;
+ /* keep tuple - this means we need to copy it into the new chunks */
+ HashJoinTuple copyTuple = (HashJoinTuple) dense_alloc(hashtable, hashTupleSize);
+ memcpy(copyTuple, hashTuple, hashTupleSize);
+
+ /* course add it to the apropriate bucket */
+ copyTuple->next = hashtable->buckets[bucketno];
+ hashtable->buckets[bucketno] = copyTuple;
}
else
{
/* dump it out */
Assert(batchno > curbatch);
- ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(tuple),
- tuple->hashvalue,
+ ExecHashJoinSaveTuple(HJTUPLE_MINTUPLE(hashTuple),
+ hashTuple->hashvalue,
&hashtable->innerBatchFile[batchno]);
- /* and remove from hash table */
- if (prevtuple)
- prevtuple->next = nexttuple;
- else
- hashtable->buckets[i] = nexttuple;
- /* prevtuple doesn't change */
- hashtable->spaceUsed -=
- HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
- pfree(tuple);
+
+ hashtable->spaceUsed -= hashTupleSize;
nfreed++;
}
- tuple = nexttuple;
+ /* bytes occupied in memory HJ tuple overhead + actual tuple length */
+ idx += MAXALIGN(hashTupleSize);
}
+
+ /* so, we're done with this chunk - free it and proceed to the next one */
+ pfree(oldchunks);
+ oldchunks = nextchunk;
}
#ifdef HJDEBUG
@@ -717,8 +740,8 @@ ExecHashTableInsert(HashJoinTable hashtable,
/* Create the HashJoinTuple */
hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
- hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
- hashTupleSize);
+ hashTuple = (HashJoinTuple) dense_alloc(hashtable, hashTupleSize);
+
hashTuple->hashvalue = hashvalue;
memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
@@ -1068,6 +1091,9 @@ ExecHashTableReset(HashJoinTable hashtable)
hashtable->spaceUsed = 0;
MemoryContextSwitchTo(oldcxt);
+
+ /* Forget the chunks (the memory was freed by the context reset above). */
+ hashtable->chunks = NULL;
}
/*
@@ -1318,6 +1344,69 @@ ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue)
return INVALID_SKEW_BUCKET_NO;
}
+static char *
+dense_alloc(HashJoinTable hashtable, int size)
+{
+
+ /* just in case the size is not already aligned properly */
+ size = MAXALIGN(size);
+
+ /* if tuple size is larger than of 1/8 of chunk size, allocate a separate chunk */
+ if (size > HASH_CHUNK_THRESHOLD) {
+
+ /* allocate new chunk and put it at the beginning of the list */
+ MemoryChunk newChunk
+ = (MemoryChunk)MemoryContextAlloc(hashtable->batchCxt,
+ offsetof(MemoryChunkData, data) + size);
+
+ newChunk->maxlen = size;
+ newChunk->used = 0;
+ newChunk->ntuples = 0;
+
+ /* If there already is a chunk, add if after it so we can still use the space in it */
+ if (hashtable->chunks != NULL) {
+ newChunk->next = hashtable->chunks->next;
+ hashtable->chunks->next = newChunk;
+ } else {
+ newChunk->next = hashtable->chunks;
+ hashtable->chunks = newChunk;
+ }
+
+ newChunk->used += size;
+ newChunk->ntuples += 1;
+
+ return newChunk->data;
+
+ }
+
+ /*
+ * If it's within the size limit, see if we have enough space for it in the
+ * current chunk (if there's one). If not, allocate a fresh chunk.
+ */
+ if ((hashtable->chunks == NULL) || (hashtable->chunks->maxlen - hashtable->chunks->used) < size) {
+
+ /* allocate new chunk and put it at the beginning of the list */
+ MemoryChunk newChunk
+ = (MemoryChunk)MemoryContextAlloc(hashtable->batchCxt,
+ offsetof(MemoryChunkData, data) + HASH_CHUNK_SIZE);
+
+ newChunk->maxlen = HASH_CHUNK_SIZE;
+ newChunk->used = 0;
+ newChunk->ntuples = 0;
+
+ newChunk->next = hashtable->chunks;
+ hashtable->chunks = newChunk;
+ }
+
+ /* OK, we have enough space in the chunk, let's add the tuple */
+ hashtable->chunks->used += size;
+ hashtable->chunks->ntuples += 1;
+
+ /* allocate pointer to the start of the tuple memory */
+ return hashtable->chunks->data + (hashtable->chunks->used - size);
+}
+
+
/*
* ExecHashSkewTableInsert
*
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..166bb7a 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -102,6 +102,23 @@ typedef struct HashSkewBucket
#define SKEW_WORK_MEM_PERCENT 2
#define SKEW_MIN_OUTER_FRACTION 0.01
+/* used for dense allocation of tuples */
+#define HASH_CHUNK_SIZE (32*1024L)
+#define HASH_CHUNK_THRESHOLD (HASH_CHUNK_SIZE/4)
+
+typedef struct MemoryChunkData
+{
+ int ntuples; /* number of tuples stored in this chunk */
+ size_t maxlen; /* length of the buffer */
+ size_t used; /* number of chunk bytes already used */
+
+ struct MemoryChunkData *next; /* pointer to the next chunk (linked list) */
+
+ char data[1]; /* buffer allocated at the end */
+} MemoryChunkData;
+
+typedef struct MemoryChunkData* MemoryChunk;
+
typedef struct HashJoinTableData
{
@@ -157,6 +174,9 @@ typedef struct HashJoinTableData
MemoryContext hashCxt; /* context for whole-hash-join storage */
MemoryContext batchCxt; /* context for this-batch-only storage */
+
+ /* used for dense allocation of tuples (into linked chunks) */
+ MemoryChunk chunks; /* one list for the whole batch */
} HashJoinTableData;
#endif /* HASHJOIN_H */
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..d128e08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,21 @@ show_hash_info(HashState *hashstate, ExplainState *es)
if (es->format != EXPLAIN_FORMAT_TEXT)
{
ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+ ExplainPropertyLong("Original Hash Buckets",
+ hashtable->nbuckets_original, es);
ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
ExplainPropertyLong("Original Hash Batches",
hashtable->nbatch_original, es);
ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
}
- else if (hashtable->nbatch_original != hashtable->nbatch)
+ else if ((hashtable->nbatch_original != hashtable->nbatch) ||
+ (hashtable->nbuckets_original != hashtable->nbuckets))
{
appendStringInfoSpaces(es->str, es->indent * 2);
appendStringInfo(es->str,
- "Buckets: %d Batches: %d (originally %d) Memory Usage: %ldkB\n",
- hashtable->nbuckets, hashtable->nbatch,
- hashtable->nbatch_original, spacePeakKb);
+ "Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n",
+ hashtable->nbuckets, hashtable->nbuckets_original,
+ hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
}
else
{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index d5acfb1..a76e878 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
int mcvsToUse);
static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -49,6 +50,9 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
static char * dense_alloc(HashJoinTable hashtable, int tupleSize);
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab) ((htab)->nbuckets_optimal * sizeof(HashJoinTuple))
+
/* ----------------------------------------------------------------
* ExecHash
*
@@ -117,6 +121,7 @@ MultiExecHash(HashState *node)
/* It's a skew tuple, so put it into that hash table */
ExecHashSkewTableInsert(hashtable, slot, hashvalue,
bucketNumber);
+ hashtable->skewTuples += 1;
}
else
{
@@ -127,6 +132,25 @@ MultiExecHash(HashState *node)
}
}
+ /* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+ if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+ {
+ /* We never decrease the number of buckets. */
+ Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+#ifdef HJDEBUG
+ printf("Increasing nbuckets %d => %d\n",
+ hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+
+ ExecHashIncreaseNumBuckets(hashtable);
+ }
+
+ /* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+ hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+ if (hashtable->spaceUsed > hashtable->spacePeak)
+ hashtable->spacePeak = hashtable->spaceUsed;
+
/* must provide our own instrumentation support */
if (node->ps.instrument)
InstrStopNode(node->ps.instrument, hashtable->totalTuples);
@@ -272,7 +296,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
*/
hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
hashtable->nbuckets = nbuckets;
+ hashtable->nbuckets_original = nbuckets;
+ hashtable->nbuckets_optimal = nbuckets;
hashtable->log2_nbuckets = log2_nbuckets;
+ hashtable->log2_nbuckets_optimal = log2_nbuckets;
hashtable->buckets = NULL;
hashtable->keepNulls = keepNulls;
hashtable->skewEnabled = false;
@@ -286,6 +313,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
hashtable->nbatch_outstart = nbatch;
hashtable->growEnabled = true;
hashtable->totalTuples = 0;
+ hashtable->skewTuples = 0;
hashtable->innerBatchFile = NULL;
hashtable->outerBatchFile = NULL;
hashtable->spaceUsed = 0;
@@ -387,7 +415,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
*/
/* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET 10
+#define NTUP_PER_BUCKET 1
void
ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
@@ -397,6 +425,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
{
int tupsize;
double inner_rel_bytes;
+ long buckets_bytes;
long hash_table_bytes;
long skew_table_bytes;
long max_pointers;
@@ -419,6 +448,16 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
inner_rel_bytes = ntuples * tupsize;
/*
+ * Compute memory occupied by buckets, assuming all tuples fit into
+ * a single batch (consider NTUP_PER_BUCKET tuples per bucket) - buckets
+ * are just 'HashJoinTuple' elements (pointers to HashJoinTupleData).
+ * Also, we never use less than 1024 buckets.
+ */
+ nbuckets = (1 << my_log2(ntuples / NTUP_PER_BUCKET));
+ nbuckets = Max(1024, nbuckets);
+ buckets_bytes = sizeof(HashJoinTuple) * nbuckets;
+
+ /*
* Target in-memory hashtable size is work_mem kilobytes.
*/
hash_table_bytes = work_mem * 1024L;
@@ -469,16 +508,13 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
/* also ensure we avoid integer overflow in nbatch and nbuckets */
max_pointers = Min(max_pointers, INT_MAX / 2);
- if (inner_rel_bytes > hash_table_bytes)
+ if ((inner_rel_bytes + buckets_bytes) > hash_table_bytes)
{
/* We'll need multiple batches */
long lbuckets;
double dbatch;
int minbatch;
-
- lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
- lbuckets = Min(lbuckets, max_pointers);
- nbuckets = (int) lbuckets;
+ long bucket_size;
dbatch = ceil(inner_rel_bytes / hash_table_bytes);
dbatch = Min(dbatch, max_pointers);
@@ -486,6 +522,54 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
nbatch = 2;
while (nbatch < minbatch)
nbatch <<= 1;
+
+ /* memory needed by a single "full" bucket (including tuples) */
+ bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
+
+ /*
+ * When batching, we use buckets size for full work_mem. We simply
+ * divide work_mem by memory needed per 'full bucket' (a pointer and
+ * NTUP_PER_BUCKET tuples, each 'tupsize' bytes, including additional
+ * overhead for hash, pointer to the next tuple etc.).
+ */
+ lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
+
+ /* protect against nbucket overflow */
+ lbuckets = Min(lbuckets, max_pointers);
+ nbuckets = (int) lbuckets;
+
+ /* Compute memory needed for buckets. */
+ buckets_bytes = nbuckets * sizeof(HashJoinTuple);
+
+ /*
+ * Buckets are simple pointers to hashjoin tuples, while tupsize is
+ * always >= 24B, plus actual data. So buckets should never really
+ * exceed 25% of work_mem (even for NTUP_PER_BUCKET=1). Except maybe
+ * for work_mem values that are not 2^N bytes, where we might get more
+ * because of doubling. So let's look for 50% here.
+ */
+ Assert(buckets_bytes <= hash_table_bytes/2);
+
+ /*
+ * Subtract the buckets from work_mem, so we know how much memory
+ * remains for the actual tuples.
+ */
+ hash_table_bytes -= buckets_bytes;
+
+ /*
+ * Increase the nbatch until we get both tuples and buckets into work_mem.
+ *
+ * The loop should not execute more than once in most cases, becase tuples are
+ * usually much wider than buckets (usually 8B pointers), so by using only
+ * (batch_bytes/2) should get us below work_mem.
+ *
+ * The worst case is that (nbuckets == 2*ntuples-1), giving us about twice the
+ * number of buckets, i.e. about 2*sizeof(void*) per tuple. But that's
+ * the consequence of how NTUP_PER_BUCKET is chosen, and work_mem limit.
+ */
+ while ((inner_rel_bytes / nbatch) > hash_table_bytes)
+ nbatch <<= 1;
+
}
else
{
@@ -620,7 +704,23 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
* and copy/insert the tuples from scratch. So set all buckets to NULL.
* And throw away the old chunks - we have a pointer in oldchunks, and
* we'll free them once we've copied all the tuples.
+ *
+ * If we need to do a resize of buckets, we can do it while rebatching.
*/
+
+ if (hashtable->nbuckets_optimal != hashtable->nbuckets)
+ {
+ /* we never decrease the number of buckets */
+ Assert(hashtable->nbuckets_optimal > hashtable->nbuckets);
+
+ hashtable->nbuckets = hashtable->nbuckets_optimal;
+ hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
+
+ hashtable->buckets = repalloc(hashtable->buckets,
+ sizeof(HashJoinTuple) * hashtable->nbuckets);
+ }
+
+ /* zero the buckets, start allocating into new chunks */
memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
hashtable->chunks = NULL;
@@ -705,6 +805,93 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
}
/*
+ * ExecHashIncreaseNumBuckets
+ * increase the original number of buckets in order to reduce
+ * number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+ MemoryChunk chunk;
+
+ /* do nothing if not an increase (it's called increase for a reason) */
+ if (hashtable->nbuckets >= hashtable->nbuckets_optimal)
+ return;
+
+ /*
+ * We already know the optimal number of buckets, so let's just
+ * compute the log2_nbuckets for it.
+ */
+ hashtable->nbuckets = hashtable->nbuckets_optimal;
+ hashtable->log2_nbuckets = my_log2(hashtable->nbuckets_optimal);
+
+ Assert(hashtable->nbuckets > 1);
+ Assert(hashtable->nbuckets <= (INT_MAX/2));
+ Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+ printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+ /*
+ * Just reallocate the proper number of buckets - we don't need to
+ * walk through them - we can walk the dense-allocated chunks
+ * (just like in ExecHashIncreaseNumBatches, but without all the
+ * copying into new chunks)
+ */
+
+ hashtable->buckets = (HashJoinTuple *) repalloc(hashtable->buckets,
+ hashtable->nbuckets * sizeof(HashJoinTuple));
+
+ memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
+
+ /* scan through all tuples in all chunks, rebuild the hash table */
+ chunk = hashtable->chunks;
+ while (chunk != NULL)
+ {
+
+ /* position within the buffer (up to chunk->used) */
+ size_t idx = 0;
+
+ /* process all tuples stored in this chunk (and then free it) */
+ while (idx < chunk->used)
+ {
+
+ /* get the hashjoin tuple and mintuple stored right after it */
+ HashJoinTuple hashTuple = (HashJoinTuple)(chunk->data + idx);
+ MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+ int hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+ int bucketno;
+ int batchno;
+
+ ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
+ &bucketno, &batchno);
+
+ /* just add the tuple to the proper bucket */
+ hashTuple->next = hashtable->buckets[bucketno];
+ hashtable->buckets[bucketno] = hashTuple;
+
+ /* bytes occupied in memory HJ tuple overhead + actual tuple length */
+ idx += MAXALIGN(hashTupleSize);
+
+ }
+
+ /* proceed to the next chunk */
+ chunk = chunk->next;
+
+ }
+
+#ifdef HJDEBUG
+ printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+ hashtable->nbuckets, batchTuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
* ExecHashTableInsert
* insert a tuple into the hash table depending on the hash value
* it may just go to a temp file for later batches
@@ -737,6 +924,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
*/
HashJoinTuple hashTuple;
int hashTupleSize;
+ double ntuples = (hashtable->totalTuples - hashtable->skewTuples);
/* Create the HashJoinTuple */
hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
@@ -757,12 +945,30 @@ ExecHashTableInsert(HashJoinTable hashtable,
hashTuple->next = hashtable->buckets[bucketno];
hashtable->buckets[bucketno] = hashTuple;
+ /*
+ * Increase the (optimal) number of buckets if we just exceeded NTUP_PER_BUCKET,
+ * but only when there's still a single batch at this moment.
+ */
+ if ((hashtable->nbatch == 1) &&
+ (hashtable->nbuckets_optimal <= INT_MAX/2) && /* overflow protection */
+ (ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
+ {
+ hashtable->nbuckets_optimal *= 2;
+ hashtable->log2_nbuckets_optimal += 1;
+ }
+
/* Account for space used, and back off if we've used too much */
hashtable->spaceUsed += hashTupleSize;
if (hashtable->spaceUsed > hashtable->spacePeak)
hashtable->spacePeak = hashtable->spaceUsed;
- if (hashtable->spaceUsed > hashtable->spaceAllowed)
+
+ /*
+ * Do we need to increase number of batches? Account for space required
+ * by buckets (optimal number).
+ */
+ if (hashtable->spaceUsed + BUCKETS_SPACE(hashtable) > hashtable->spaceAllowed)
ExecHashIncreaseNumBatches(hashtable);
+
}
else
{
@@ -885,7 +1091,10 @@ ExecHashGetHashValue(HashJoinTable hashtable,
* functions are good about randomizing all their output bits, else we are
* likely to have very skewed bucket or batch occupancy.)
*
- * nbuckets doesn't change over the course of the join.
+ * The nbuckets/log2_nbuckets may change while (nbatch==1) because of dynamic
+ * buckets growth. Once we start batching, the value is fixed and does not
+ * change over the course of join (making it possible to compute batch number
+ * the way we do here).
*
* nbatch is always a power of 2; we increase it only by doubling it. This
* effectively adds one more bit to the top of the batchno.
@@ -1351,7 +1560,7 @@ dense_alloc(HashJoinTable hashtable, int size)
/* just in case the size is not already aligned properly */
size = MAXALIGN(size);
- /* if tuple size is larger than of 1/8 of chunk size, allocate a separate chunk */
+ /* if tuple size is larger than of 1/4 of chunk size, allocate a separate chunk */
if (size > HASH_CHUNK_THRESHOLD) {
/* allocate new chunk and put it at the beginning of the list */
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 166bb7a..f0775ea 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -125,6 +125,10 @@ typedef struct HashJoinTableData
int nbuckets; /* # buckets in the in-memory hash table */
int log2_nbuckets; /* its log2 (nbuckets must be a power of 2) */
+ int nbuckets_original; /* # buckets when starting the first hash */
+ int nbuckets_optimal; /* optimal # buckets (per batch) */
+ int log2_nbuckets_optimal; /* same as log2_nbuckets optimal */
+
/* buckets[i] is head of list of tuples in i'th in-memory bucket */
struct HashJoinTupleData **buckets;
/* buckets array is per-batch storage, as are all the tuples */
@@ -146,6 +150,7 @@ typedef struct HashJoinTableData
bool growEnabled; /* flag to shut off nbatch increases */
double totalTuples; /* # tuples obtained from inner plan */
+ double skewTuples; /* # tuples inserted into skew tuples */
/*
* These arrays are allocated for the life of the hash join, but only if
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers