On 2/17/22 21:15, Robert Haas wrote: > On Tue, Feb 1, 2022 at 10:08 AM Andy Fan <zhihui.fan1...@gmail.com> wrote: >> To address the row estimation issue, The most straightforward way to fix >> this is to >> ignore the derived clauses when figuring out the RelOptInfo->rows on base >> relation. >> To note which clause is derived from this patch, I added a new field >> "EquivalenceClass * >> derived" in RestrictInfo. and then added a included_derived option in >> clauselist_selectivity_ext, >> during the set_xxx_rel_size function, we can pass the >> included_derived=false. This strategy >> should be used in get_parameterized_baserel_size. In all the other cases, >> include_derived=true >> is used. which are finished in commit 2. (Commit 1 is Daivd's patch, I just >> rebased it) > > That doesn't sound correct to me. > > Suppose that we have A.x = B.x and also A.x < 42. We can choose to > enforce A.x < 42 or we can choose to enforce B.x < 42 or we can do > both. In general, any of those could be right: if either one of those > two is highly selective while the other is not very selective at all, > it's going to be fastest to enforce only the more selective qual. But > if both are selective then it may be best to enforce both, so let's > suppose we do that. If we don't adopt the proposal above and just do > nothing, then our row count estimates for both A and B will include > the effect of checking x < 42, and so they will be correct, but the > row count estimate for join(A, B) will include the effect of checking > x < 42 twice, and so it will be too low, which can mess up the plan at > higher levels. > > But discounting the effect of B.x < 42 when estimating the size of B > is also incorrect. Now, the row count estimate for join(A, B) will > include the effect of x < 42 only once, which is good. However, the > row count estimate for B will be too high, because it will not include > the effect of B.x < 42. And that means that the cost estimate for > join(A, B) will be wrong. It will be too high, because it's going to > think that it has more rows coming from the B side of the join than > what is actually the case. And that can also mess up the plan at > higher levels. > > I think we could get around this problem by having multiple > RelOptInfos (or something similar that is lighter-weight) for each > relation. Today, we'd create a RelOptInfo for A, one for B, and one > for join(A, B), and the paths for the join are created by joining a > path for A to a path for B. Now imagine that we have instead 5 > RelOptInfos, for {A}, {A|x<42}, {B}, {B|x<42}, and join(A, B). The > legal paths for that last one can be created by joining {A} to > {B|x<42} or {A|x<42} to {B} or {A|x<42} to {B|x<42}. Each of those 5 > RelOptInfos can have its own cardinality estimate, and it seems pretty > straightforward to see how to get both the scan cardinality and the > join cardinality correct. Now I think this is decidedly non-trivial to > implement, and I also hear the voice of Tom Lane saying that's going > to be expensive in both time and memory, and he's not wrong. >
IMHO the whole problem is we're unable to estimate the join clause as a conditional probability, i.e. P(A.x = B.x | (A.x < 42) & (B.x < 42)) so maybe instead of trying to generate additional RelOptInfo items we should think about improving that. The extra RelOptInfos don't really solve this, because even if you decide to join A|x<42 to B|x<42 it does nothing to improve the join clause estimate. With equality clauses we don't have this issue, because if you derive clauses at the baserel level, the join clause becomes no-op with selecitivity 1.0. But for inequalities that does not work ... Interestingly enough, the patch [1] tries to do something like this by applying extended statistics to joins, and using baserestrictinfos as "conditions" for statistics on both sides. It actually deals with a more general form of this case, because the clauses don't need to reference the same attribute - so for example this would work too, assuming there is extended stats object on the columns on each side: P(A.c = B.d | (A.e < 42) & (B.f < 42)) [1] https://commitfest.postgresql.org/36/3055/ > On the other hand, I completely agree with David's comments on the > other thread to the effect that holding our breath is not getting us > anywhere. People don't keep asking for this feature because it's a > stupid thing that nobody really wants, and when Tom alleges that it > will rarely pay off, I think he's pretty far off the mark. The only > time we need to consider doing any extra work is when we have > something like the example discussed here, namely A.x = B.x and A.x < > 42. If there is a variable that is part of an equivalence class and > also is used in a scan qual, what are the chances that the implied > inequality is useful? There's no way to estimate that mathematically - > it's all about what you think human beings are typically going to do - > but I'd say it's probably better than 50%. I know that when I was > regularly doing application programming on top of PostgreSQL I was > VERY aware of this limitation of the optimizer and habitually thought > about which table to write the inequality against. That kept me out of > trouble most of the time, but it sure seems like we're punting the > optimizer's job to the end user. > Not sure. In my experience queries with both a join clause and other clauses referencing the same attribute are pretty rare. But I agree if we can do the expensive stuff only when actually needed, with no cost in the 99.999% other cases, I don't see why not. Of course, code complexity is a cost too. > And even then, I still sometimes couldn't stay out of trouble, because > sometimes I knew that the implied inequality really ought to be > enforced against both sides of the join to get a decent plan. In that > case, the only way to get the optimizer to do what I wanted was to > duplicate the qual. But that runs headlong into the exact problem that > we're talking about here: now the join selectivity is going to be > messed up, and then some other part of the plan would get messed up. I > still remember the frustration associated with that scenario more than > 10 years later. You can't even fix it by uglifying your query with a > planner hint, because we don't support those either. Which brings me > to another point: it's incoherent to simultaneously argue that we > shouldn't have planner hints but rather focus on improving the > planner, and at the same time refuse to improve the planner because it > would make planning too expensive. I actually think we should do both, > because I neither believe that it's impossible to fix this particular > problem nor that it is possible to create a planner so good that it > always makes the right decisions without any explicit input from a > human being. But the only way you can think this problem is unfixable > and at the same time think we don't need hints is if you think this > problem is fake. > IMHO to deal with the estimates it'd be enough to allow calculating conditional probabilities. No comment regarding hints ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company