Jefffrey commented on PR #8780: URL: https://github.com/apache/arrow-datafusion/pull/8780#issuecomment-1886700745
So there's quite a few failing tests, I chose the first I saw: `dataframe::tests::join`: https://github.com/apache/arrow-datafusion/blob/a154884545cfdeb1a6c20872b3882a5624cd1119/datafusion/core/src/dataframe/mod.rs#L1892-L1906 Pretty simple join test. Running it fails: ``` Error: SchemaError(FieldNotFound { field: Column { relation: Some(Bare { table: "c2" }), name: "c1" }, valid_fields: [Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c1" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c2" }] }, Some("")) ``` So lets change the test to explain the output and see what the plan looks like: ```rust #[tokio::test] async fn join() -> Result<()> { let left = test_table().await?.select_columns(&["c1", "c2"])?; let right = test_table_with_name("c2") .await? .select_columns(&["c1", "c3"])?; let join = left.join(right, JoinType::Inner, &["c1"], &["c1"], None)?; join.explain(true, false)?.show().await?; // HERE Ok(()) } ``` Gives this output (truncated only to the interesting parts): ``` +------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | plan_type | plan | +------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ | initial_logical_plan | Inner Join: aggregate_test_100.c1 = c2.c1 | | | Projection: aggregate_test_100.c1, aggregate_test_100.c2 | | | TableScan: aggregate_test_100 | | | Projection: c2.c1, c2.c3 | | | TableScan: c2 | ~~~ | logical_plan after simplify_expressions | Inner Join: c2.c1 = aggregate_test_100.c1 | | | Projection: aggregate_test_100.c1, aggregate_test_100.c2 | | | TableScan: aggregate_test_100 | | | Projection: c2.c1, c2.c3 | | | TableScan: c2 | ~~~ | logical_plan after eliminate_cross_join | Inner Join: aggregate_test_100.c1 = c2.c1 | | | Projection: aggregate_test_100.c1, aggregate_test_100.c2 | | | TableScan: aggregate_test_100 | | | Projection: c2.c1, c2.c3 | | | TableScan: c2 | ~~~ | logical_plan after simplify_expressions | Inner Join: c2.c1 = aggregate_test_100.c1 | | | Projection: aggregate_test_100.c1, aggregate_test_100.c2 | | | TableScan: aggregate_test_100 | | | Projection: c2.c1, c2.c3 | | | TableScan: c2 | ~~~ | logical_plan after optimize_projections | Inner Join: c2.c1 = aggregate_test_100.c1 | | | TableScan: aggregate_test_100 projection=[c1, c2] | | | TableScan: c2 projection=[c1, c3] | ~~~ | logical_plan after eliminate_cross_join | Inner Join: aggregate_test_100.c1 = c2.c1 | | | TableScan: aggregate_test_100 projection=[c1, c2] | | | TableScan: c2 projection=[c1, c3] | ~~~ | logical_plan after simplify_expressions | Inner Join: c2.c1 = aggregate_test_100.c1 | | | TableScan: aggregate_test_100 projection=[c1, c2] | | | TableScan: c2 projection=[c1, c3] | ~~~ | logical_plan | Inner Join: c2.c1 = aggregate_test_100.c1 | | | TableScan: aggregate_test_100 projection=[c1, c2] | | | TableScan: c2 projection=[c1, c3] | | initial_physical_plan | Schema error: No field named c2.c1. Valid fields are aggregate_test_100.c1, aggregate_test_100.c2. | +------------------------------------------------------------+----------------------------------------------------------------------------------------------------+ ``` - Removed rules which didn't change output What is interesting here, is how `simplify_expressions` and `eliminate_cross_join` 'fight' over the order of the join, and in the end, `simplify_expressions` 'wins' with its order as there is no more `eliminate_cross_join` run after. This seems to suggest the order of the columns in the join expression is important. Indeed, after some more sleuthing to find the cause (enabling `backtrace` feature and setting `RUST_BACKTRACE=1` then running original test to see where the error originated), it is here: https://github.com/apache/arrow-datafusion/blob/a154884545cfdeb1a6c20872b3882a5624cd1119/datafusion/core/src/physical_planner.rs#L1055-L1065 Specifically line 1061. `keys` is grabbed directly from the `on` condition of the LogicalPlan Join. We can see here that it relies on the assumption that given a join on condition of `a = b`, column `a` must be from the left child and column `b` must be from the right child. I hope this helps in explaining the error @yyy1000 (and the process to debug it) As for how to fix it, I'm not sure. Either we respect the current behaviour of relying on the order of columns in a join on expression to be correct with respect to its children, or we try to fix that to not rely on that. Going with the former might suggest having to ensure that this new canonicalize step won't reorder expressions in a Join on clause. Going with the latter, I'm not so sure, would need some input from those more familiar with the Join logic in Datafusion. Just throwing in my 2 cents -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
