On Wed, 11 Aug 2021 at 00:05, Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > So with the statistics, the estimate gets a bit worse. The reason is > fairly simple - if you look at the two parts of the OR clause, we get this: > > clause actual no stats with stats > --------------------------------------------------------------- > (A < B and A <> A) 0 331667 1 > not (A < A) 1000000 333333 333333 > > This clearly shows that the first clause is clearly improved, while the > (A < A) is estimated the same way, because the clause has a single Var > so it's considered to be "simple" so we ignore the MCV selectivity and > just use the simple_sel calculated by clause_selectivity_ext. > > And the 333333 and 331667 just happen to be closer to the actual row > count. But that's mostly by luck, clearly. > > But now that I think about it, maybe the problem really is in how > statext_mcv_clauselist_selectivity treats this clause - the definition > of "simple" clauses as "has one attnum" was appropriate when only > clauses (Var op Const) were supported. But with (Var op Var) that's > probably not correct anymore. >
Hmm, interesting. Clearly the fact that the combined estimate without extended stats was better was just luck, based on it's large overestimate of the first clause. But it's also true that a (Var op Var) clause should not be treated as simple, because "simple" in this context is meant to be for clauses that are likely to be better estimated with regular stats, whereas in this case, extended stats would almost certainly do better on the second clause. Perhaps the easiest way to identify simple clauses would be in statext_is_compatible_clause(), rather than the way it's done now, because it has the relevant information at hand, so it could be made to return an extra flag. This feels like rather an artificial example though. Is there any real use for this sort of clause? Regards, Dean