Well, one option would be to modify all selectivity functions to do
something like the patch does for nulltestsel(). That seems a bit
cumbersome because why should those places care about maybe running on
the outer side of a join, or what? For code in extensions this would be
particularly problematic, I think.
Agree. I would say that we can try it if nothing else works out.
So what I was thinking about doing this in a way that'd make this
automatic, without having to modify the selectivity functions.
Option (3) is very simple - examine_variable would simply adjust the
statistics by tweaking the null_frac field, when looking at variables on
the outer side of the join. But it has issues when estimating multiple
conditions.
Imagine t1 has 1M rows, and we want to estimate
SELECT * FROM t1 LEFT JOIN t2 ON (t1.id = t2.id)
WHERE ((t2.a=1) AND (t2.b=1))
but only 50% of the t1 rows has a match in t2. Assume each of the t2
conditions matches 100% rows in the table. With the correction, this
means 50% selectivity for each condition. And if we combine them the
usual way, it's 0.5 * 0.5 = 0.25.
But we know all the rows in the "matching" part match the condition, so
the correct selectivity should be 0.5.
In a way, this is just another case of estimation issues due to the
assumption of independence.
FWIW, I used "AND" in the example for simplicity, but that'd probably be
pushed to the baserel level. There'd need to be OR to keep it at the
join level, but the overall issue is the same, I think.
Also, this entirely ignores extended statistics - I have no idea how we
might tweak those in (3).
I understood the idea - it is very similar to what is implemented in the
current patch.
But I don't understand how to do it in the examine_variable function, to
be honest.
But (4) was suggesting we could improve this essentially by treating the
join as two distinct sets of rows
- the inner join result
- rows without match on the outer side
For the inner part, we would do estimates as now (using the regular
per-column statistics). If we knew the conditions match 100% rows, we'd
still get 100% when the conditions are combined.
For the second part of the join we know the outer side is just NULLs in
all columns, and that'd make the estimation much simpler for most
clauses. We'd just need to have "fake" statistics with null_frac=1.0 and
that's it.
And then we'd just combine these two selectivities. If we know the inner
side is 50% and all rows match the conditions, and no rows in the other
50% match, the selectivity is 50%.
inner_part * inner_sel + outer_part * outer_sel = 0.5 * 1.0 + 0.0 = 0.5
Now, we still have issues with independence assumption in each of these
parts separately. But that's OK, I think.
I think (4) could be implemented by doing the current estimation for the
inner part, and by tweaking examine_variable in the "outer" part in a
way similar to (3). Except that it just sets null_frac=1.0 everywhere.
For (4) we don't need to tweak those at all,
because for inner part we can just apply them as is, and for outer part
it's irrelevant because everything is NULL.
I like this idea the most) I'll try to start with this and implement the
patch.
I hope this makes more sense. If not, let me know and I'll try to
explain it better.
Thank you for your explanation)
I will unsubscribe soon based on the results or if I have any questions.
--
Regards,
Alena Rybakina
Postgres Professional