Jefffrey commented on PR #8780:
URL: 
https://github.com/apache/arrow-datafusion/pull/8780#issuecomment-1886700745

   So there's quite a few failing tests, I chose the first I saw: 
`dataframe::tests::join`:
   
   
https://github.com/apache/arrow-datafusion/blob/a154884545cfdeb1a6c20872b3882a5624cd1119/datafusion/core/src/dataframe/mod.rs#L1892-L1906
   
   Pretty simple join test.
   
   Running it fails:
   
   ```
   Error: SchemaError(FieldNotFound { field: Column { relation: Some(Bare { 
table: "c2" }), name: "c1" }, valid_fields: [Column { relation: Some(Bare { 
table: "aggregate_test_100" }), name: "c1" }, Column { relation: Some(Bare { 
table: "aggregate_test_100" }), name: "c2" }] }, Some(""))
   ```
   
   So lets change the test to explain the output and see what the plan looks 
like:
   
   ```rust
       #[tokio::test]
       async fn join() -> Result<()> {
           let left = test_table().await?.select_columns(&["c1", "c2"])?;
           let right = test_table_with_name("c2")
               .await?
               .select_columns(&["c1", "c3"])?;
           let join = left.join(right, JoinType::Inner, &["c1"], &["c1"], 
None)?;
           join.explain(true, false)?.show().await?; // HERE
           Ok(())
       }
   ```
   
   Gives this output (truncated only to the interesting parts):
   
   ```
   
+------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
   | plan_type                                                  | plan          
                                                                                
     |
   
+------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
   | initial_logical_plan                                       | Inner Join: 
aggregate_test_100.c1 = c2.c1                                                   
       |
   |                                                            |   Projection: 
aggregate_test_100.c1, aggregate_test_100.c2                                    
     |
   |                                                            |     
TableScan: aggregate_test_100                                                   
               |
   |                                                            |   Projection: 
c2.c1, c2.c3                                                                    
     |
   |                                                            |     
TableScan: c2                                                                   
               |
   ~~~
   | logical_plan after simplify_expressions                    | Inner Join: 
c2.c1 = aggregate_test_100.c1                                                   
       |
   |                                                            |   Projection: 
aggregate_test_100.c1, aggregate_test_100.c2                                    
     |
   |                                                            |     
TableScan: aggregate_test_100                                                   
               |
   |                                                            |   Projection: 
c2.c1, c2.c3                                                                    
     |
   |                                                            |     
TableScan: c2                                                                   
               |
   ~~~
   | logical_plan after eliminate_cross_join                    | Inner Join: 
aggregate_test_100.c1 = c2.c1                                                   
       |
   |                                                            |   Projection: 
aggregate_test_100.c1, aggregate_test_100.c2                                    
     |
   |                                                            |     
TableScan: aggregate_test_100                                                   
               |
   |                                                            |   Projection: 
c2.c1, c2.c3                                                                    
     |
   |                                                            |     
TableScan: c2                                                                   
               |
   ~~~
   | logical_plan after simplify_expressions                    | Inner Join: 
c2.c1 = aggregate_test_100.c1                                                   
       |
   |                                                            |   Projection: 
aggregate_test_100.c1, aggregate_test_100.c2                                    
     |
   |                                                            |     
TableScan: aggregate_test_100                                                   
               |
   |                                                            |   Projection: 
c2.c1, c2.c3                                                                    
     |
   |                                                            |     
TableScan: c2                                                                   
               |
   ~~~
   | logical_plan after optimize_projections                    | Inner Join: 
c2.c1 = aggregate_test_100.c1                                                   
       |
   |                                                            |   TableScan: 
aggregate_test_100 projection=[c1, c2]                                          
      |
   |                                                            |   TableScan: 
c2 projection=[c1, c3]                                                          
      |
   ~~~
   | logical_plan after eliminate_cross_join                    | Inner Join: 
aggregate_test_100.c1 = c2.c1                                                   
       |
   |                                                            |   TableScan: 
aggregate_test_100 projection=[c1, c2]                                          
      |
   |                                                            |   TableScan: 
c2 projection=[c1, c3]                                                          
      |
   ~~~
   | logical_plan after simplify_expressions                    | Inner Join: 
c2.c1 = aggregate_test_100.c1                                                   
       |
   |                                                            |   TableScan: 
aggregate_test_100 projection=[c1, c2]                                          
      |
   |                                                            |   TableScan: 
c2 projection=[c1, c3]                                                          
      |
   ~~~
   | logical_plan                                               | Inner Join: 
c2.c1 = aggregate_test_100.c1                                                   
       |
   |                                                            |   TableScan: 
aggregate_test_100 projection=[c1, c2]                                          
      |
   |                                                            |   TableScan: 
c2 projection=[c1, c3]                                                          
      |
   | initial_physical_plan                                      | Schema error: 
No field named c2.c1. Valid fields are aggregate_test_100.c1, 
aggregate_test_100.c2. |
   
+------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
   ```
   
   - Removed rules which didn't change output
   
   What is interesting here, is how `simplify_expressions` and 
`eliminate_cross_join` 'fight' over the order of the join, and in the end, 
`simplify_expressions` 'wins' with its order as there is no more 
`eliminate_cross_join` run after. This seems to suggest the order of the 
columns in the join expression is important.
   
   Indeed, after some more sleuthing to find the cause (enabling `backtrace` 
feature and setting `RUST_BACKTRACE=1` then running original test to see where 
the error originated), it is here:
   
   
https://github.com/apache/arrow-datafusion/blob/a154884545cfdeb1a6c20872b3882a5624cd1119/datafusion/core/src/physical_planner.rs#L1055-L1065
   
   Specifically line 1061.
   
   `keys` is grabbed directly from the `on` condition of the LogicalPlan Join.
   
   We can see here that it relies on the assumption that given a join on 
condition of `a = b`, column `a` must be from the left child and column `b` 
must be from the right child.
   
   I hope this helps in explaining the error @yyy1000  (and the process to 
debug it)
   
   As for how to fix it, I'm not sure. Either we respect the current behaviour 
of relying on the order of columns in a join on expression to be correct with 
respect to its children, or we try to fix that to not rely on that.
   
   Going with the former might suggest having to ensure that this new 
canonicalize step won't reorder expressions in a Join on clause.
   
   Going with the latter, I'm not so sure, would need some input from those 
more familiar with the Join logic in Datafusion.
   
   Just throwing in my 2 cents


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to