On Mon, May 20, 2019 at 12:22 PM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > On Mon, May 20, 2019 at 11:07:03AM +1200, Thomas Munro wrote: > >First let me restate the PostgreSQL terminology for this stuff so I > >don't get confused while talking about it: > > > >* The inner side of the join = the right side = the side we use to > >build a hash table. Right and full joins emit inner tuples when there > >is no matching tuple on the outer side. > > > >* The outer side of the join = the left side = the side we use to > >probe the hash table. Left and full joins emit outer tuples when > >there is no matching tuple on the inner side. > > > >* Semi and anti joins emit exactly one instance of each outer tuple if > >there is/isn't at least one match on the inner side. > > > > I think you're conflating inner/outer side and left/right, or rather > assuming it's always left=inner and right=outer.
In PostgreSQL, it's always inner = right, outer = left. You can see that reflected in plannodes.h and elsewhere: /* ---------------- * these are defined to avoid confusion problems with "left" * and "right" and "inner" and "outer". The convention is that * the "left" plan is the "outer" plan and the "right" plan is * the inner plan, but these make the code more readable. * ---------------- */ #define innerPlan(node) (((Plan *)(node))->righttree) #define outerPlan(node) (((Plan *)(node))->lefttree) I'm not sure you think it's not always like that: are you referring to the fact that the planner can choose to reverse the join (compared to the SQL LEFT|RIGHT JOIN that appeared in the query), creating an extra layer of confusion? In my email I was talking only about left and right as seen by the executor. > >About the question of when exactly to set the "use_NLJ" flag: I had > >originally been thinking of this only as a way to deal with the > >extreme skew problem. But in light of Tomas's complaints about > >unmetered per-batch memory overheads, I had a new thought: it should > >also be triggered whenever doubling the number of batches would halve > >the amount of memory left for the hash table (after including the size > >of all those BufFile objects in the computation as Tomas proposes). I > >think that might be exactly the right right cut-off if you want to do > >as much Grace partitioning as your work_mem can afford, and therefore > >as little looping as possible to complete the join while respecting > >work_mem. > > > > Not sure what NLJ flag rule you propose, exactly. > > Regarding the threshold value - once the space for BufFiles (and other > overhead) gets over work_mem/2, it does not make any sense to increase > the number of batches because then the work_mem would be entirely > occupied by BufFiles. > > The WIP patches don't actually do exactly that though - they just check > if the incremented size would be over work_mem/2. I think we should > instead allow up to work_mem*2/3, i.e. stop adding batches after the > BufFiles start consuming more than work_mem/3 memory. > > I think that's actually what you mean by "halving the amount of memory > left for the hash table" because that's what happens after reaching the > work_mem/3. Well, instead of an arbitrary number like work_mem/2 or work_mem * 2/3, I was trying to figure out the precise threshold beyond which it doesn't make sense to expend more memory on BufFile objects, even if the keys are uniformly distributed so that splitting batches halves the expect tuple count per batch. Let work_mem_for_hash_table = work_mem - nbatch * sizeof(BufFile). Whenever you increase nbatch, work_mem_for_hash_table goes down, but it had better be more than half what it was before, or we expect to run out of memory again (if the batch didn't fit before, and we're now splitting it so that we'll try to load only half of it, we'd better have more than half the budget for the hash table than we had before). Otherwise you'd be making matters worse, and this process probably won't terminate. > But I think that rule is irrelevant here, really, because this thread > was discussing cases where adding batches is futile due to skew, no? In > which case we should stop adding batches after reaching some % of tuples > not moving from the batch. Yeah, this thread started off just about the 95% thing, but veered off course since these topics are tangled up. Sorry. > Or are you suggesting we should remove that rule, and instead realy on > this rule about halving the hash table space? That might work too, I > guess. No, I suspect you need both rules. We still want to detect extreme skew soon as possible, even though the other rule will eventually fire; might as well do it sooner in clear-cut cases. > OTOH I'm not sure it's a good idea to handle both those cases the same > way - "overflow file" idea works pretty well for cases where the hash > table actually can be split into batches, and I'm afraid NLJ will be > much less efficient for those cases. Yeah, you might be right about that, and everything I'm describing is pure vapourware anyway. But your overflow file scheme isn't exactly free of IO-amplification and multiple-processing of input data either... and I haven't yet grokked how it would work for parallel hash. Parallel hash generally doesn't have the 'throw-the-tuples-forward' concept. which is inherently based on sequential in-order processing of batches. -- Thomas Munro https://enterprisedb.com