irenjj commented on PR #16016: URL: https://github.com/apache/datafusion/pull/16016#issuecomment-2907724823
The difference between DataFusion and DuckDB in constructing logical plans is: DataFusion directly assigns schema to `LogicalPlan`, while DuckDB saves metadata information in the `Binder`. In DuckDB, an independent `Binder` is constructed for each query block layer. During plan construction, if `correlated_columns` are found and the required column information for `correlated_columns` cannot be found in the current layer's Binder, it will search in the upper layer Binder. If found, the Binder depth information is recorded in `correlated_columns`. Since Binder and LogicalPlan are separated, even if the planner stage unnests correlated subqueries (subquery->join, the depth information of subqueries changes), the corresponding columns can still be retrieved through depth during subsequent physical planning because the Binder is separated. DataFusion's handling of this is to put schema information into the logical plan during logical plan construction, and only supports logical plan construction for correlated subqueries of adjacent levels: ```rust pub(super) fn parse_scalar_subquery( &self, subquery: Query, input_schema: &DFSchema, planner_context: &mut PlannerContext, ) -> Result<Expr> { let old_outer_query_schema = planner_context.set_outer_query_schema(Some(input_schema.clone().into())); let sub_plan = self.query_to_plan(subquery, planner_context)?; let outer_ref_columns = sub_plan.all_out_ref_exprs(); planner_context.set_outer_query_schema(old_outer_query_schema); ... ``` As you can see, it only takes `input_schema`, which is the schema of the previous query block, to construct the plan. I first tried to address this issue by merging/layering `input_schema` each time entering `parse_scalar_subquery`, and enabling normal recognition and completion of logical plan construction in `query_to_plan`. However, when it came to physical planning, the problem appeared again. Since the logical plan doesn't contain the schema required by `outer_ref_col`, it will still report errors. But why can adjacent-level correlated subqueries run? Adjacent-level correlated subqueries also construct `outer_ref_col` during construction, and similarly don't have corresponding schema in subqueries. This is because after experiencing optimizer decorrelation, correlated subqueries are converted to joins. When constructing physical plans, filters are constructed based on `filter_schema = left_schema + right_schema`, and `left_schema` happens to be the upper level of adjacent subqueries. So the ability to run normally is just a coincidence. DataFusion doesn't yet have the capability to read schema across query blocks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org