pjmore commented on issue #77:
URL: https://github.com/apache/arrow-datafusion/issues/77#issuecomment-840742157


   So I took a look at this and I have two solutions one which I believe always 
finds all possible inner joins but runs in N^2 and one that works for this case 
and should work for most others that runs in N LogN where N is the number of 
plans in the select statement. Is the N^2 runtime and the added petgraph 
dependency alright? I don't want datafusion to show up on [Accidentally 
quadratic](https://accidentallyquadratic.tumblr.com/) because of me haha. I've 
attached a brief description of each method incase there are any obvious 
improvements that can be made. 
   
   
   #### Thorough version
   
[Code](https://github.com/pjmore/arrow-datafusion/blob/tpch-q9-planner/datafusion/src/sql/planner.rs#L471)
   1. Create undirected graph of plans where each edge represents a join 
connection.
   2. Find the connected components of the graph.
   3. For each connected component find the plan with the schema that has the 
fewest number of columns.
   4. Cross join all of the previously selected plans into a new node, removing 
the selected nodes from the graph.
   5. Starting from an arbitrary node, traverse the graph performing inner 
joins on all remaining nodes.
   
   #### Less thorough version
   
[Code](https://github.com/pjmore/arrow-datafusion/blob/tpch-query-9-simple/datafusion/src/sql/planner.rs#L468)
   1. Iterate over plans counting the number of join key hits in each schema.
   2. Sort plans so that plans that have no key hits are at the beginning and 
then the vector is sorted in descending order. E.g. the array (0,0,1,2,3,4) 
would be sorted to (0,0, 4,3,2,1). The rationale being that cross joins should 
prefer to run early to limit the amount of data that needs to be duplicated and 
that there are more likely to be joins to plans with many join keys.
   3. Build the plan as was done previously.
   
   
   The less thorough version was able to find the inner join that was missed 
for query 9 but there is still the possibility to miss inner joins. For example 
if the tables have the join relations as seen below, the order of the plans 
after sorting would be (B,F,D,A,C,E,G). Which means that the plan would 
generate a cross join between B and F as they share no join key pairs, missing 
the inner join through D.
   ```
   ┌─┐                ┌─┐
   │A│                │E│
   └┬┘                └┬┘
    │                  │
   ┌┴┐      ┌─┐       ┌┴┐
   │B├──────┤D├───────┤F│
   └┬┘      └─┘       └┬┘
    │                  │
   ┌┴┐                ┌┴┐
   │C│                │G│
   └─┘                └─┘
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to