On 11 July 2014 10:23, Tomas Vondra <t...@fuzzy.cz> wrote: > On 11 Červenec 2014, 9:27, Simon Riggs wrote: >> On 9 July 2014 18:54, Tomas Vondra <t...@fuzzy.cz> wrote: >> >>> (1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number >>> of batches this requires) >> >> If we start off by assuming NTUP_PER_BUCKET = 1, how much memory does >> it save to recalculate the hash bucket at 10 instead? >> Resizing sounds like it will only be useful of we only just overflow our >> limit. >> >> If we release next version with this as a hardcoded change, my >> understanding is that memory usage for hash joins will leap upwards, >> even if the run time of queries reduces. It sounds like we need some >> kind of parameter to control this. "We made it faster" might not be >> true if we run this on servers that are already experiencing high >> memory pressure. > > Sure. We certainly don't want to make things worse for environments with > memory pressure. > > The current implementation has two issues regarding memory: > > (1) It does not include buckets into used memory, i.e. it's not included > into work_mem (so we may overflow work_mem). I plan to fix this, to make > work_mem a bit more correct, as it's important for cases with > NTUP_PER_BUCKET=1. > > (2) There's a significant palloc overhead, because of allocating each > tuple separately - see my message from yesterday, where I observed the > batch memory context to get 1.4GB memory for 700MB of tuple data. By > densely packing the tuples, I got down to ~700MB (i.e. pretty much no > overhead). > > The palloc overhead seems to be 20B (on x86_64) per tuple, and eliminating > this it more than compensates for ~8B per tuple, required for > NTUP_PER_BUCKET=1. And fixing (1) makes it more correct / predictable. > > It also improves the issue that palloc overhead is not counted into > work_mem at all (that's why I got ~1.4GB batch context with work_mem=1GB). > > So in the end this should give us much lower memory usage for hash joins, > even if we switch to NTUP_PER_BUCKET=1 (although that's pretty much > independent change). Does that seem reasonable?
Yes, that alleviates my concern. Thanks. > Regarding the tunable to control this - I certainly don't want another GUC > no one really knows how to set right. And I think it's unnecessary thanks > to the palloc overhead / work_mem accounting fix, described above. Agreed, nor does anyone. > The one thing I'm not sure about is what to do in case of reaching the > work_mem limit (which should only happen with underestimated row count / > row width) - how to decide whether to shrink the hash table or increment > the number of batches. But this is not exclusive to NTUP_PER_BUCKET=1, it > may happen with whatever NTUP_PER_BUCKET value you choose. > > The current code does not support resizing at all, so it always increments > the number of batches, but I feel an "interleaved" approach might be more > appropriate (nbuckets/2, nbatches*2, nbuckets/2, nbatches*2, ...). It'd be > nice to have some cost estimates ('how expensive is a rescan' vs. 'how > expensive is a resize'), but I'm not sure how to get that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers