[
https://issues.apache.org/jira/browse/FLINK-14625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973011#comment-16973011
]
Caizhi Weng commented on FLINK-14625:
-------------------------------------
Hi dear Flink community!
I would like to implement this improvement. My ideas are explained as follows:
h3. 1. How (unnecessary) cross joins are generated
Consider a SQL query
{code:sql}
SELECT * FROM A, B, C, D
WHERE A.f1 = C.f1
AND C.f2 = D.f2
AND D.f3 = B.f3
AND C.f4 = B.f4
{code}
If we join the tables together in the original order of A, B, C and D, a cross
join will be generated when joining A and B because there is no join filter
between them.
But if we change the join order to A, C, D and B, as every two neighboring
tables are related with an equal join filter, no cross join will be generated.
Such plan is our desired plan.
So the problem becomes: how to discover the "connection" between input tables?
h3. 2. Discover connections between inputs with join graphs
Let's consider each input table as a vertex and each equal join filter as an
edge between vertices. In this way, the join query will be changed into a
graph. For example, the above SQL query can be considered as
{code:java}
A B
| / |
| / |
C----D
{code}
As there will be no cross joins between inputs connected by edges, there will
be no cross joins in a connected component. What we need to do is to first join
the inputs connected by edges together and then determine the join orders
between connected components.
h3. 3. Determine the new join order
The new join order will be determined by the following rules:
# Inputs connected by equal join filters have the highest priority. As stated
above, there will be no cross joins between them. Inputs with smaller index
will be joined first, as we want to keep the original join order as much as
possible.
# After the 1st step, several connected components will be generated.
Connected components related with any other join filters will be joined next.
Although we will need a nested-loop-join to join them, join filters will help
us reduce the result size as much as possible.
# After the 2nd step, if not all inputs are joined together, join them in
arbitrary order. Nothing can help us determine a better order now (without
further statistics).
For example, consider the following SQL
{code:sql}
SELECT * FROM A, B, C, D, E, F, G, H
WHERE A.f1 = C.f1
AND C.f2 = D.f2
AND D.f3 = B.f3
AND C.f4 = B.f4
AND E.f5 = G.f5
AND F.f6 = H.f6
AND (A.f7 < E.f7 OR B.f8 > G.f8)
{code}
The join graph is as follows
{code:java}
A B E----G
| / |
| / |
C----D F----H
{code}
The inputs are joined with the following steps:
# Join (A&C, B&C, B&D), (E&G), (F&H). No cross join will be generated as they
are connected by edges. Three connected components will form after this step,
as shown in the join graph.
# Join ABCD & EG, as there is a non-equal filter between these two connected
components.
# Join ABCDEG & FH, as all filters are consumed and not all inputs have been
joined together.
> Eliminate cross join in multi join to reduce cost
> --------------------------------------------------
>
> Key: FLINK-14625
> URL: https://issues.apache.org/jira/browse/FLINK-14625
> Project: Flink
> Issue Type: Improvement
> Components: Table SQL / Planner
> Affects Versions: 1.9.1
> Reporter: Leonard Xu
> Priority: Major
>
> cross join always has higher cost than other joins, if we can eliminate
> cross join in multi join scene as much as possible,we can obtain better
> performance.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)