Re: using extended statistics to improve join estimates

Andy Fan Sat, 06 Nov 2021 03:03:54 -0700

Hi Tomas:

This is the exact patch I want, thanks for the patch!

On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:

> 3) estimation by join pairs
>
> At the moment, the estimates are calculated for pairs of relations, so
> for example given a query
>
>   explain analyze
>   select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
>                    join t3 on (t1.b = t3.b and t2.c = t3.c);
>
> we'll estimate the first join (t1,t2) just fine, but then the second
> join actually combines (t1,t2,t3). What the patch currently does is it
> splits it into (t1,t2) and (t2,t3) and estimates those.

Actually I can't understand how this works even for a simpler example.
let's say we query like this (ONLY use t2's column to join t3).

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
                         join t3 on (t2.c = t3.c and t2.d = t3.d);

Then it works well on JoinRel(t1, t2)  AND JoinRel(t2, t3).  But when comes
to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel,  so it
is hard to
work.  Here I see your solution is splitting it into (t1, t2) AND (t2,
t3) and estimate
those.  But how does this help to estimate the size of JoinRel(t1, t2, t3)?

> I wonder if this
> should actually combine all three MCVs at once - we're pretty much
> combining the MCVs into one large MCV representing the join result.
>

I guess we can keep the MCVs on joinrel for these matches.  Take the above
query I provided for example, and suppose the MCV data as below:

t1(a, b)
(1, 2) -> 0.1
(1, 3) -> 0.2
(2, 3) -> 0.5
(2, 8) -> 0.1

t2(a, b)
(1, 2) -> 0.2
(1, 3) -> 0.1
(2, 4) -> 0.2
(2, 10) -> 0.1

After t1.a = t2.a AND t1.b = t2.b,  we can build the MCV as below

(1, 2, 1, 2)  -> 0.1 * 0.2
(1, 3, 1, 3)  -> 0.2 * 0.1

And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) *
(0.2 + 0.1 + 0.2 + 0.1)

With this design, the nitems of MCV on joinrel would be less than
either of baserel.

and since we handle the eqjoin as well, we even can record the items as

(1, 2) -> 0.1 * 0.2
(1, 3) -> 0.2 * 0.1;

About when we should maintain the JoinRel's MCV data, rather than
maintain this just
after the JoinRel size is estimated, we can only estimate it when it
is needed.  for example:

select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b)
                         join t3 on (t2.c = t3.c and t2.d = t3.d);

we don't need to maintain the MCV on (t1, t2, t3)  since no others
need it at all. However
I don't check code too closely to see if it (Lazing computing MVC on
joinrel) is convenient
to do.

> But I haven't done that yet, as it requires the MCVs to be combined
> using the join clauses (overlap in a way), but I'm not sure how likely
> that is in practice. In the example it could help, but that's a bit
> artificial example.
>
>
> 4) still just inner equi-joins
>
> I haven't done any work on extending this to outer joins etc. Adding
> outer and semi joins should not be complicated, mostly copying and
> tweaking what eqjoinsel() does.
>

Overall, thanks for the feature and I am expecting there are more cases
to handle during discussion.  To make the review process more efficient,
I suggest that we split the patch into smaller ones and review/commit them
separately if we have finalized the design roughly . For example:

Patch 1 -- required both sides to have extended statistics.
Patch 2 -- required one side to have extended statistics and the other side had
per-column MCV.
Patch 3 -- handle the case like  WHERE t1.a = t2.a  and t1.b = Const;
Patch 3 -- handle the case for 3+ table joins.
Patch 4 -- supports the outer join.

I think we can do this if we are sure that each individual patch would work in
some cases and would not make any other case worse.  If you agree with this,
I can do that splitting work during my review process.

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)

Re: using extended statistics to improve join estimates

Reply via email to