alamb commented on code in PR #3726:
URL: https://github.com/apache/arrow-datafusion/pull/3726#discussion_r988379685
##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -448,7 +474,21 @@ impl ExpressionVisitor for ExprIdentifierVisitor<'_> {
self.id_array[idx] = (self.series_number, desc.clone());
self.visit_stack.push(VisitRecord::ExprItem(desc.clone()));
- let data_type = self.data_type.clone();
+
+ let data_type = if let Ok(data_type) =
expr.get_type(&self.input_schema) {
+ data_type
+ } else {
+ // expression type could not be resolved in schema, fall back to
all schemas
+ let merged_schema =
+ self.all_schemas
+ .iter()
+ .fold(DFSchema::empty(), |mut lhs, rhs| {
+ lhs.merge(rhs);
+ lhs
+ });
Review Comment:
I wonder if we could use `reduce` here (I realize you just refactored this
code)
```suggestion
let merged_schema =
self.all_schemas
.iter()
.reduce(|mut lhs, rhs| {
lhs.merge(rhs);
lhs
});
```
##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -370,7 +391,12 @@ struct ExprIdentifierVisitor<'a> {
expr_set: &'a mut ExprSet,
/// series number (usize) and identifier.
id_array: &'a mut Vec<(usize, Identifier)>,
- data_type: DataType,
+ /// input schema for the node that we're optimizing, so we can determine
the correct datatype
+ /// for each subexpression
+ input_schema: DFSchemaRef,
+ /// all schemas in the logical plan, as a fall back if we cannot resolve
an expression type
+ /// from the input schema alone
+ all_schemas: Vec<DFSchemaRef>,
Review Comment:
I don't understand in what cases we wouldn't be able to resolve an expr type
from the input schema alone.
The only case I can think of is when the plan node has more than one input
(e.g. a Join or a Union) -- but thus I would expect that we always resolve the
type of the expressions using the input schema
##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -448,7 +474,21 @@ impl ExpressionVisitor for ExprIdentifierVisitor<'_> {
self.id_array[idx] = (self.series_number, desc.clone());
self.visit_stack.push(VisitRecord::ExprItem(desc.clone()));
- let data_type = self.data_type.clone();
+
+ let data_type = if let Ok(data_type) =
expr.get_type(&self.input_schema) {
+ data_type
+ } else {
+ // expression type could not be resolved in schema, fall back to
all schemas
+ let merged_schema =
+ self.all_schemas
+ .iter()
+ .fold(DFSchema::empty(), |mut lhs, rhs| {
+ lhs.merge(rhs);
+ lhs
+ });
+ expr.get_type(&merged_schema)?
+ };
Review Comment:
I think all exprs should be resolvable with the unified schemas, as
explained above -- but maybe it is a performance optimization 🤔.
Perhaps you could leave a comment explaining that we are not sure if it is
necessary
##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -597,22 +640,36 @@ mod test {
fn id_array_visitor() -> Result<()> {
let expr = binary_expr(
binary_expr(
- sum(binary_expr(col("a"), Operator::Plus, lit("1"))),
+ sum(binary_expr(col("a"), Operator::Plus, lit(1))),
Review Comment:
I agree this test doesn't make sense as coercion should have happened before
this pass
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]