[ 
https://issues.apache.org/jira/browse/FLINK-14625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973011#comment-16973011
 ] 

Caizhi Weng commented on FLINK-14625:
-------------------------------------

Hi dear Flink community!

I would like to implement this improvement. My ideas are explained as follows:
h3. 1. How (unnecessary) cross joins are generated

Consider a SQL query
{code:sql}
SELECT * FROM A, B, C, D
WHERE A.f1 = C.f1
AND C.f2 = D.f2
AND D.f3 = B.f3
AND C.f4 = B.f4
{code}
If we join the tables together in the original order of A, B, C and D, a cross 
join will be generated when joining A and B because there is no join filter 
between them.

But if we change the join order to A, C, D and B, as every two neighboring 
tables are related with an equal join filter, no cross join will be generated. 
Such plan is our desired plan.

So the problem becomes: how to discover the "connection" between input tables?
h3. 2. Discover connections between inputs with join graphs

Let's consider each input table as a vertex and each equal join filter as an 
edge between vertices. In this way, the join query will be changed into a 
graph. For example, the above SQL query can be considered as
{code:java}
A    B
|  / |
| /  |
C----D
{code}
As there will be no cross joins between inputs connected by edges, there will 
be no cross joins in a connected component. What we need to do is to first join 
the inputs connected by edges together and then determine the join orders 
between connected components.
h3. 3. Determine the new join order

The new join order will be determined by the following rules:
 # Inputs connected by equal join filters have the highest priority. As stated 
above, there will be no cross joins between them. Inputs with smaller index 
will be joined first, as we want to keep the original join order as much as 
possible.
 # After the 1st step, several connected components will be generated. 
Connected components related with any other join filters will be joined next. 
Although we will need a nested-loop-join to join them, join filters will help 
us reduce the result size as much as possible.
 # After the 2nd step, if not all inputs are joined together, join them in 
arbitrary order. Nothing can help us determine a better order now (without 
further statistics).

For example, consider the following SQL
{code:sql}
SELECT * FROM A, B, C, D, E, F, G, H
WHERE A.f1 = C.f1
AND C.f2 = D.f2
AND D.f3 = B.f3
AND C.f4 = B.f4
AND E.f5 = G.f5
AND F.f6 = H.f6
AND (A.f7 < E.f7 OR B.f8 > G.f8)
{code}
The join graph is as follows
{code:java}
A    B    E----G
|  / |
| /  |
C----D    F----H
{code}

The inputs are joined with the following steps:
# Join (A&C, B&C, B&D), (E&G), (F&H). No cross join will be generated as they 
are connected by edges. Three connected components will form after this step, 
as shown in the join graph.
# Join ABCD & EG, as there is a non-equal filter between these two connected 
components.
# Join ABCDEG & FH, as all filters are consumed and not all inputs have been 
joined together.

> Eliminate cross join in multi join to reduce cost 
> --------------------------------------------------
>
>                 Key: FLINK-14625
>                 URL: https://issues.apache.org/jira/browse/FLINK-14625
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Planner
>    Affects Versions: 1.9.1
>            Reporter: Leonard Xu
>            Priority: Major
>
> cross join always has  higher cost than other joins, if we can eliminate 
> cross join in multi join scene as much as possible,we can obtain  better 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to