On 10.7.2014 21:33, Tomas Vondra wrote:
> On 9.7.2014 16:07, Robert Haas wrote:
>> On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra <t...@fuzzy.cz> wrote:
>>> Thinking about this a bit more, do we really need to build the hash
>>> table on the first pass? Why not to do this:
>>> (1) batching
>>>     - read the tuples, stuff them into a simple list
>>>     - don't build the hash table yet
>>> (2) building the hash table
>>>     - we have all the tuples in a simple list, batching is done
>>>     - we know exact row count, can size the table properly
>>>     - build the table
>> We could do this, and in fact we could save quite a bit of memory if
>> we allocated say 1MB chunks and packed the tuples in tightly instead
>> of palloc-ing each one separately.  But I worry that rescanning the
>> data to build the hash table would slow things down too much.
> I did a quick test of how much memory we could save by this. The 
> attached patch densely packs the tuples into 32kB chunks (1MB seems
> way too much because of small work_mem values, but I guess this might
> be tuned based on number of tuples / work_mem size ...).

Turns out getting this working properly will quite complicated. The
patch was only a quick attempt to see how much overhead there is, and
didn't solve one important details - batching.

The problem is that when increasing the number of batches, we need to
get rid of the tuples from one of them. Which means calling pfree() on
the tuples written to a temporary file, and that's not possible with the
dense allocation.

1) copy into new chunks (dead end)

The first idea I had was to just "copy" everything into new chunks and
then throw away the old ones, but this way we might end up using 150% of
work_mem on average (when the two batches are about 1/2 the data each),
and in an extreme case up to 200% of work_mem (all tuples having the
same key and thus falling into the same batch). So I don't think that's
really a good approach.

2) walking through the tuples sequentially

The other option is not to walk the tuples through buckets, but by
walking throught the chunks - we know the tuples are stored as
HashJoinTuple/MinimalTuple, so it shouldn't be that difficult. And by
walking the tuples in the order they're stored, we may just rearrange
the tuples in already allocated memory. And maybe it could be even
faster than the current code, because of the sequential access to the
memory (as opposed to the random access through buckets) and CPU caches.
So I like this approach - it's simple, clean and shouldn't add any
overhead (memory or CPU).

3) batch-aware chunks

I also think a "batch-aware" chunks might work. Say we're starting with
nbatch=N. Instead of allocating everything in a single chunk, we'll
allocate the tuples from the chunks according to a "hypothetical batch
number" - what batch would the tuple belong to if we had (nbatch=N*2).
So we'd have two chunks (or sequences of chunks), and we'd allocate the
tuples into them.

Then if we actually need to increase the number of batches, we know we
can just free one of the chunks, because all of the tuples are from the
'wrong' batche, and redistribute the remaining tuples into the new
chunks (according to the new hypothetical batch numbers).

I'm not sure how much overhead this would be, though. I think it can be
done quite efficiently, though, and it shouldn't have any impact at all,
if we don't do any additional batching (i.e. if the initial estimates
are ok).

Any other ideas how to tackle this?


Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to