Hi Tomas: This is the exact patch I want, thanks for the patch!
On Thu, Oct 7, 2021 at 3:33 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > 3) estimation by join pairs > > At the moment, the estimates are calculated for pairs of relations, so > for example given a query > > explain analyze > select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) > join t3 on (t1.b = t3.b and t2.c = t3.c); > > we'll estimate the first join (t1,t2) just fine, but then the second > join actually combines (t1,t2,t3). What the patch currently does is it > splits it into (t1,t2) and (t2,t3) and estimates those. Actually I can't understand how this works even for a simpler example. let's say we query like this (ONLY use t2's column to join t3). select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t2.c = t3.c and t2.d = t3.d); Then it works well on JoinRel(t1, t2) AND JoinRel(t2, t3). But when comes to JoinRel(t1, t2, t3), we didn't maintain the MCV on join rel, so it is hard to work. Here I see your solution is splitting it into (t1, t2) AND (t2, t3) and estimate those. But how does this help to estimate the size of JoinRel(t1, t2, t3)? > I wonder if this > should actually combine all three MCVs at once - we're pretty much > combining the MCVs into one large MCV representing the join result. > I guess we can keep the MCVs on joinrel for these matches. Take the above query I provided for example, and suppose the MCV data as below: t1(a, b) (1, 2) -> 0.1 (1, 3) -> 0.2 (2, 3) -> 0.5 (2, 8) -> 0.1 t2(a, b) (1, 2) -> 0.2 (1, 3) -> 0.1 (2, 4) -> 0.2 (2, 10) -> 0.1 After t1.a = t2.a AND t1.b = t2.b, we can build the MCV as below (1, 2, 1, 2) -> 0.1 * 0.2 (1, 3, 1, 3) -> 0.2 * 0.1 And recording the total mcv frequence as (0.1 + 0.2 + 0.5 + 0.1) * (0.2 + 0.1 + 0.2 + 0.1) With this design, the nitems of MCV on joinrel would be less than either of baserel. and since we handle the eqjoin as well, we even can record the items as (1, 2) -> 0.1 * 0.2 (1, 3) -> 0.2 * 0.1; About when we should maintain the JoinRel's MCV data, rather than maintain this just after the JoinRel size is estimated, we can only estimate it when it is needed. for example: select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b) join t3 on (t2.c = t3.c and t2.d = t3.d); we don't need to maintain the MCV on (t1, t2, t3) since no others need it at all. However I don't check code too closely to see if it (Lazing computing MVC on joinrel) is convenient to do. > But I haven't done that yet, as it requires the MCVs to be combined > using the join clauses (overlap in a way), but I'm not sure how likely > that is in practice. In the example it could help, but that's a bit > artificial example. > > > 4) still just inner equi-joins > > I haven't done any work on extending this to outer joins etc. Adding > outer and semi joins should not be complicated, mostly copying and > tweaking what eqjoinsel() does. > Overall, thanks for the feature and I am expecting there are more cases to handle during discussion. To make the review process more efficient, I suggest that we split the patch into smaller ones and review/commit them separately if we have finalized the design roughly . For example: Patch 1 -- required both sides to have extended statistics. Patch 2 -- required one side to have extended statistics and the other side had per-column MCV. Patch 3 -- handle the case like WHERE t1.a = t2.a and t1.b = Const; Patch 3 -- handle the case for 3+ table joins. Patch 4 -- supports the outer join. I think we can do this if we are sure that each individual patch would work in some cases and would not make any other case worse. If you agree with this, I can do that splitting work during my review process. -- Best Regards Andy Fan (https://www.aliyun.com/)