On Mon, Apr 3, 2017 at 9:19 PM, Claudio Freire <klaussfre...@gmail.com> wrote: > On Mon, Apr 3, 2017 at 8:52 PM, David Rowley > <david.row...@2ndquadrant.com> wrote: >>> One last observation: >>> >>> + /* >>> + * An IS NOT NULL test is a no-op if there's any other strict >>> quals, >>> + * so if that's the case, then we'll only apply this, otherwise >>> we'll >>> + * ignore it. >>> + */ >>> + else if (cslist->selmask == CACHESEL_NOTNULLTEST) >>> + s1 *= cslist->notnullsel; >>> >>> In some paths, the range selectivity estimation code punts and returns >>> DEFAULT_RANGE_INEQ_SEL. In those cases, if a not null test was >>> present, it should still be applied. It could make a big difference if >>> the selectivity for the nulltest is high. >> >> I'd say that the location that applies the DEFAULT_RANGE_INEQ_SEL >> should apply the nullfrac. Seems wrong to rely on a IS NOT NULL test >> to exists to estimate that properly. I don't think that needs done as >> part of this patch. I'd rather limit the scope since "returned with >> feedback" is already knocking at the door of this patch. > > I agree, but this patch regresses for those cases if the user > compensated with a not null test, leaving little recourse, so I'd fix > it in this patch to avoid the regression. > > Maybe you're right that the null fraction should be applied regardless > of whether there's an explicit null check, though.
The more I read that code, the more I'm convinced there's no way to get a DEFAULT_RANGE_INEQ_SEL if (rqlist->hibound != DEFAULT_INEQ_SEL && rqlist->lobound != DEFAULT_INEQ_SEL) other than from contradictory clauses, a case the current code handles very poorly, returning DEFAULT_RANGE_INEQ_SEL and ignoring nullfrac, something that could give way-off estimates if the table had lots of nulls. Say, consider if "value" was mostly null and you write: SELECT * FROM testdata WHERE value BETWEEN 10 AND 20 AND value BETWEEN 50 AND 60; All other cases I can think of would end with either hibound or lobound being equal to DEFAULT_INEQ_SEL. It seems to me, doing a git blame, that the checks in the else branch were left only as a precaution, one that seems unnecessary. If you ask me, I'd just leave: + if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound == DEFAULT_INEQ_SEL) + { + /* No point in checking null selectivity, DEFAULT_INEQ_SEL implies we have no stats */ + s2 = DEFAULT_RANGE_INEQ_SEL; + } + else + { ... + s2 = rqlist->hibound + rqlist->lobound - 1.0; + nullsel = nulltestsel(root, IS_NULL, rqlist->var, varRelid); + notnullsel = 1.0 - nullsel; + + /* Adjust for double-exclusion of NULLs */ + s2 += nullsel; + if (s2 <= 0.0) { + if (s2 <= (-1.0e-4 * notnullsel - 1.0e-6)) + { + /* Most likely contradictory clauses found */ + s2 = (1.0e-10 < notnullsel) ? 1.0e-10 : notnullsel; + } + else + { + /* Could be a rounding error */ + s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel; + } + } + } Where (-1.0e-4 * notnullsel - 1.0e-6) is just a very rough (and overly cautious) estimation of the amount of rounding error that could be there with 32-bit floats. The above assumes a non-DEFAULT_INEQ_SEL value in lobound/hibound is an estimation based on stats, maybe approximate, but one that makes sense (ie: preserves the monotonicity of the CPF), and as such negative values are either sign of a contradiction or rounding error. The previous code left the possibility of negatives coming out of default selectivities creeping up on non-DEFAULT_INEQ_SEL estimates, but that doesn't seem possible coming out of scalarineqsel. But I'd like it if Tom could comment on that, since he's the one that did that in commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a (way back in 2004). Barring that, I'd just change the s2 = DEFAULT_RANGE_INEQ_SEL; To s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel; Which makes a lot more sense. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers