On 10.7.2014 21:33, Tomas Vondra wrote: > On 9.7.2014 16:07, Robert Haas wrote: >> On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra <t...@fuzzy.cz> wrote: >>> Thinking about this a bit more, do we really need to build the hash >>> table on the first pass? Why not to do this: >>> >>> (1) batching >>> - read the tuples, stuff them into a simple list >>> - don't build the hash table yet >>> >>> (2) building the hash table >>> - we have all the tuples in a simple list, batching is done >>> - we know exact row count, can size the table properly >>> - build the table >> >> We could do this, and in fact we could save quite a bit of memory if >> we allocated say 1MB chunks and packed the tuples in tightly instead >> of palloc-ing each one separately. But I worry that rescanning the >> data to build the hash table would slow things down too much. > > I did a quick test of how much memory we could save by this. The > attached patch densely packs the tuples into 32kB chunks (1MB seems > way too much because of small work_mem values, but I guess this might > be tuned based on number of tuples / work_mem size ...).
Turns out getting this working properly will quite complicated. The patch was only a quick attempt to see how much overhead there is, and didn't solve one important details - batching. The problem is that when increasing the number of batches, we need to get rid of the tuples from one of them. Which means calling pfree() on the tuples written to a temporary file, and that's not possible with the dense allocation. 1) copy into new chunks (dead end) ---------------------------------- The first idea I had was to just "copy" everything into new chunks and then throw away the old ones, but this way we might end up using 150% of work_mem on average (when the two batches are about 1/2 the data each), and in an extreme case up to 200% of work_mem (all tuples having the same key and thus falling into the same batch). So I don't think that's really a good approach. 2) walking through the tuples sequentially ------------------------------------------ The other option is not to walk the tuples through buckets, but by walking throught the chunks - we know the tuples are stored as HashJoinTuple/MinimalTuple, so it shouldn't be that difficult. And by walking the tuples in the order they're stored, we may just rearrange the tuples in already allocated memory. And maybe it could be even faster than the current code, because of the sequential access to the memory (as opposed to the random access through buckets) and CPU caches. So I like this approach - it's simple, clean and shouldn't add any overhead (memory or CPU). 3) batch-aware chunks --------------------- I also think a "batch-aware" chunks might work. Say we're starting with nbatch=N. Instead of allocating everything in a single chunk, we'll allocate the tuples from the chunks according to a "hypothetical batch number" - what batch would the tuple belong to if we had (nbatch=N*2). So we'd have two chunks (or sequences of chunks), and we'd allocate the tuples into them. Then if we actually need to increase the number of batches, we know we can just free one of the chunks, because all of the tuples are from the 'wrong' batche, and redistribute the remaining tuples into the new chunks (according to the new hypothetical batch numbers). I'm not sure how much overhead this would be, though. I think it can be done quite efficiently, though, and it shouldn't have any impact at all, if we don't do any additional batching (i.e. if the initial estimates are ok). Any other ideas how to tackle this? regards Tomas -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers