dqkqd commented on code in PR #18286:
URL: https://github.com/apache/datafusion/pull/18286#discussion_r2496114961
##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -633,8 +633,46 @@ impl LogicalPlan {
LogicalPlan::Dml(_) => Ok(self),
LogicalPlan::Copy(_) => Ok(self),
LogicalPlan::Values(Values { schema, values }) => {
- // todo it isn't clear why the schema is not recomputed here
- Ok(LogicalPlan::Values(Values { schema, values }))
+ // We cannot compute the correct schema if we only use values.
+ //
+ // For example, given the following plan:
+ // Projection: col_1, col_2
+ // Values: (Float32(1), Float32(10)), (Float32(100),
Float32(10))
+ //
+ // We wouldn't know about `col_1`, and `col_2` if we only
relied on `values`.
+ // To correctly recompute the new schema, we also need to
retain some information
+ // from the original schema.
+ let new_plan =
LogicalPlanBuilder::values(values.clone())?.build()?;
+
+ let qualified_fields = schema
+ .iter()
+ .zip(new_plan.schema().fields())
+ .map(|((table_ref, old_field), new_field)| {
+ // `old_field`'s data type is unknown but
`new_field`'s is known
+ if old_field.data_type().is_null()
+ && !new_field.data_type().is_null()
+ {
+ let field = old_field
+ .as_ref()
+ .clone()
+ .with_data_type(new_field.data_type().clone());
+ (table_ref.cloned(), Arc::new(field))
Review Comment:
Thank you for looking at this
> Am I correct in my understanding here?
Correct.
Allow me to further explain why we need `self` and why we don't need the
whole `self`.
We need `self`:
- Point 1: `new plan` uses different name, this is clearly an issue.
- Point 2: `new plan` uses different nullability, this allows
`INSERT INTO non_null_table VALUES (NULL)` to pass (even though it
shouldn't),
because non-nullable `self` schema is replaced with nullable `new plan`
schema.
We don't need the whole `self`:
Let use the sql in this issue `SELECT a, b FROM (VALUES ($1, $2)) AS t(a,
b)` as reference.
After replacing params:
- `self` (`t(a,b)`) doesn't know about `a` and `b` data types.
- `new plan` (`VALUES ($1, $2)`) knows about the data types, but doesn't
know about `a` and `b`.
So, I was trying to apply new data types while keeping everything in `self`
intact.
It was basically just these two lines, but since I couldn't find something
similar to
`merge_data_type`, I ended up with the whole schema migration.
```diff
LogicalPlan::Values(Values { schema, values }) => {
+ let new_plan = // compute new plan here
+ schema.merge_data_type(new_plan.schema());
Ok(LogicalPlan::Values(Values { schema, values }))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]