irenjj commented on PR #16016:
URL: https://github.com/apache/datafusion/pull/16016#issuecomment-2907724823
The difference between DataFusion and DuckDB in constructing logical plans
is: DataFusion directly assigns schema to `LogicalPlan`, while DuckDB saves
metadata information in the `Binder`. In DuckDB, an independent `Binder` is
constructed for each query block layer. During plan construction, if
`correlated_columns` are found and the required column information for
`correlated_columns` cannot be found in the current layer's Binder, it will
search in the upper layer Binder. If found, the Binder depth information is
recorded in `correlated_columns`. Since Binder and LogicalPlan are separated,
even if the planner stage unnests correlated subqueries (subquery->join, the
depth information of subqueries changes), the corresponding columns can still
be retrieved through depth during subsequent physical planning because the
Binder is separated.
DataFusion's handling of this is to put schema information into the logical
plan during logical plan construction, and only supports logical plan
construction for correlated subqueries of adjacent levels:
```rust
pub(super) fn parse_scalar_subquery(
&self,
subquery: Query,
input_schema: &DFSchema,
planner_context: &mut PlannerContext,
) -> Result<Expr> {
let old_outer_query_schema =
planner_context.set_outer_query_schema(Some(input_schema.clone().into()));
let sub_plan = self.query_to_plan(subquery, planner_context)?;
let outer_ref_columns = sub_plan.all_out_ref_exprs();
planner_context.set_outer_query_schema(old_outer_query_schema);
...
```
As you can see, it only takes `input_schema`, which is the schema of the
previous query block, to construct the plan. I first tried to address this
issue by merging/layering `input_schema` each time entering
`parse_scalar_subquery`, and enabling normal recognition and completion of
logical plan construction in `query_to_plan`. However, when it came to physical
planning, the problem appeared again. Since the logical plan doesn't contain
the schema required by `outer_ref_col`, it will still report errors.
But why can adjacent-level correlated subqueries run? Adjacent-level
correlated subqueries also construct `outer_ref_col` during construction, and
similarly don't have corresponding schema in subqueries. This is because after
experiencing optimizer decorrelation, correlated subqueries are converted to
joins. When constructing physical plans, filters are constructed based on
`filter_schema = left_schema + right_schema`, and `left_schema` happens to be
the upper level of adjacent subqueries. So the ability to run normally is just
a coincidence. DataFusion doesn't yet have the capability to read schema across
query blocks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]