Re: [PR] feat: rewrite subquery into dependent join logical plan [datafusion]

via GitHub Sun, 25 May 2025 04:08:54 -0700


irenjj commented on PR #16016:
URL: https://github.com/apache/datafusion/pull/16016#issuecomment-2907724823


   The difference between DataFusion and DuckDB in constructing logical plans 
is: DataFusion directly assigns schema to `LogicalPlan`, while DuckDB saves 
metadata information in the `Binder`. In DuckDB, an independent `Binder` is 
constructed for each query block layer. During plan construction, if 
`correlated_columns` are found and the required column information for 
`correlated_columns` cannot be found in the current layer's Binder, it will 
search in the upper layer Binder. If found, the Binder depth information is 
recorded in `correlated_columns`. Since Binder and LogicalPlan are separated, 
even if the planner stage unnests correlated subqueries (subquery->join, the 
depth information of subqueries changes), the corresponding columns can still 
be retrieved through depth during subsequent physical planning because the 
Binder is separated.
   DataFusion's handling of this is to put schema information into the logical 
plan during logical plan construction, and only supports logical plan 
construction for correlated subqueries of adjacent levels:
   ```rust
   pub(super) fn parse_scalar_subquery(
           &self,
           subquery: Query,
           input_schema: &DFSchema,
           planner_context: &mut PlannerContext,
       ) -> Result<Expr> {
           let old_outer_query_schema =
               
planner_context.set_outer_query_schema(Some(input_schema.clone().into()));
               let sub_plan = self.query_to_plan(subquery, planner_context)?;
               let outer_ref_columns = sub_plan.all_out_ref_exprs();
               planner_context.set_outer_query_schema(old_outer_query_schema);
            ...
   ```
   As you can see, it only takes `input_schema`, which is the schema of the 
previous query block, to construct the plan. I first tried to address this 
issue by merging/layering `input_schema` each time entering 
`parse_scalar_subquery`, and enabling normal recognition and completion of 
logical plan construction in `query_to_plan`. However, when it came to physical 
planning, the problem appeared again. Since the logical plan doesn't contain 
the schema required by `outer_ref_col`, it will still report errors.
   But why can adjacent-level correlated subqueries run? Adjacent-level 
correlated subqueries also construct `outer_ref_col` during construction, and 
similarly don't have corresponding schema in subqueries. This is because after 
experiencing optimizer decorrelation, correlated subqueries are converted to 
joins. When constructing physical plans, filters are constructed based on 
`filter_schema = left_schema + right_schema`, and `left_schema` happens to be 
the upper level of adjacent subqueries. So the ability to run normally is just 
a coincidence. DataFusion doesn't yet have the capability to read schema across 
query blocks.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: rewrite subquery into dependent join logical plan [datafusion]

Reply via email to