Re: using extended statistics to improve join estimates

Tomas Vondra Mon, 13 Dec 2021 10:25:24 -0800

On 11/6/21 11:03, Andy Fan wrote:

Hi Tomas:


This is the exact patch I want, thanks for the patch!


Good to hear.

On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
<[email protected]> wrote:

3) estimation by join pairs

At the moment, the estimates are calculated for pairs of relations, so
for example given a query

   explain analyze
   select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
                    join t3 on (t1.b = t3.b and t2.c = t3.c);

we'll estimate the first join (t1,t2) just fine, but then the second
join actually combines (t1,t2,t3). What the patch currently does is it
splits it into (t1,t2) and (t2,t3) and estimates those.


Actually I can't understand how this works even for a simpler example.
let's say we query like this (ONLY use t2's column to join t3).

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
                          join t3 on (t2.c = t3.c and t2.d = t3.d);

Then it works well on JoinRel(t1, t2)  AND JoinRel(t2, t3).  But when comes
to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel,  so it
is hard to
work.  Here I see your solution is splitting it into (t1, t2) AND (t2,
t3) and estimate
those.  But how does this help to estimate the size of JoinRel(t1, t2, t3)?

Yeah, this is really confusing. The crucial thing to keep in mind isthis works with clauses before running setrefs.c, so the clausesreference the original relations - *not* the join relation. Otherwiseeven the regular estimation would not work, because where would it getthe per-column MCV lists etc.

Let's use a simple case with join clauses referencing just a singleattribute for each pair or relations. And let's talk about how many joinpairs it'll extract:


  t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b)

=> first we join t1/t2, which is 1 join pair (t1,t2)

=> then we join t1/t2/t3, but the join clause references just 2 rels, so1 join pair (t1,t3)


Now a more complicated case, with more complex join clause

  t1 JOIN t2 ON (t1.a = t2.a) JOIN t3 ON (t1.b = t3.b AND t2.c = t3.c)

=> first we join t1/t2, which is 1 join pair (t1,t2)

=> then we join t1/t2/t3, but this time the join clause references allthree rels, so we have two join pairs (t1,t3) and (t2,t3) and we can useall the statistics.

I wonder if this
should actually combine all three MCVs at once - we're pretty much
combining the MCVs into one large MCV representing the join result.


I guess we can keep the MCVs on joinrel for these matches.  Take the above
query I provided for example, and suppose the MCV data as below:

t1(a, b)
(1, 2) -> 0.1
(1, 3) -> 0.2
(2, 3) -> 0.5
(2, 8) -> 0.1

t2(a, b)
(1, 2) -> 0.2
(1, 3) -> 0.1
(2, 4) -> 0.2
(2, 10) -> 0.1

After t1.a = t2.a AND t1.b = t2.b,  we can build the MCV as below

(1, 2, 1, 2)  -> 0.1 * 0.2
(1, 3, 1, 3)  -> 0.2 * 0.1

And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
(0.2 + 0.1 + 0.2 + 0.1)


Right, that's about the joint distribution I whole join.

With this design, the nitems of MCV on joinrel would be less than
either of baserel.

Actually, I think the number of items can grow, because the matches mayduplicate some items. For example in your example with (t1.a = t2.a) thefirst first (1,2) item in t1 matches (1,2) and (1,3) in t2. And same for(1,3) in t1. So that's 4 combinations. Of course, we could aggregate theMCV by ignoring columns not used in the query.

and since we handle the eqjoin as well, we even can record the items as

(1, 2) -> 0.1 * 0.2
(1, 3) -> 0.2 * 0.1;

About when we should maintain the JoinRel's MCV data, rather than
maintain this just
after the JoinRel size is estimated, we can only estimate it when it
is needed.  for example:

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
                          join t3 on (t2.c = t3.c and t2.d = t3.d);

we don't need to maintain the MCV on (t1, t2, t3)  since no others
need it at all. However
I don't check code too closely to see if it (Lazing computing MVC on
joinrel) is convenient
to do.


I'm not sure I understand what you're proposing here.

However, I think that estimating it for pairs has two advantages:

1) Combining MCVs for k relations requires k for loops. Processing 2relations at a time limits the amount of CPU we need. Of course, thisassumes the joins are independent, which may or may not be true.

2) It seems fairly easy to combine different types of statistics(regular, extended, ...), and also consider the part not represented byMCV. It seems much harder when joining more than 2 relations.

I'm also worried about amplification of errors - I suspect attempting tobuild the joint MCV for the whole join relation may produce significantestimation errors.

Furthermore, I think joins with clauses referencing more than just tworelations are fairly uncommon. And we can always improve the feature inthis direction in the future.

But I haven't done that yet, as it requires the MCVs to be combined
using the join clauses (overlap in a way), but I'm not sure how likely
that is in practice. In the example it could help, but that's a bit
artificial example.


4) still just inner equi-joins

I haven't done any work on extending this to outer joins etc. Adding
outer and semi joins should not be complicated, mostly copying and
tweaking what eqjoinsel() does.


Overall, thanks for the feature and I am expecting there are more cases
to handle during discussion.  To make the review process more efficient,
I suggest that we split the patch into smaller ones and review/commit them
separately if we have finalized the design roughly . For example:

Patch 1 -- required both sides to have extended statistics.
Patch 2 -- required one side to have extended statistics and the other side had
per-column MCV.
Patch 3 -- handle the case like  WHERE t1.a = t2.a  and t1.b = Const;
Patch 3 -- handle the case for 3+ table joins.
Patch 4 -- supports the outer join.

I think we can do this if we are sure that each individual patch would work in
some cases and would not make any other case worse.  If you agree with this,
I can do that splitting work during my review process.

I'll consider splitting it like this, but I'm not sure it makes the mainpatch that much smaller.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: using extended statistics to improve join estimates

Reply via email to