[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3726: Fix output schema generated by CommonSubExprEliminate

GitBox Wed, 05 Oct 2022 14:46:49 -0700


alamb commented on code in PR #3726:
URL: https://github.com/apache/arrow-datafusion/pull/3726#discussion_r988379685



##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -448,7 +474,21 @@ impl ExpressionVisitor for ExprIdentifierVisitor<'_> {
 
         self.id_array[idx] = (self.series_number, desc.clone());
         self.visit_stack.push(VisitRecord::ExprItem(desc.clone()));
-        let data_type = self.data_type.clone();
+
+        let data_type = if let Ok(data_type) = 
expr.get_type(&self.input_schema) {
+            data_type
+        } else {
+            // expression type could not be resolved in schema, fall back to 
all schemas
+            let merged_schema =
+                self.all_schemas
+                    .iter()
+                    .fold(DFSchema::empty(), |mut lhs, rhs| {
+                        lhs.merge(rhs);
+                        lhs
+                    });

Review Comment:
   I wonder if we could use `reduce` here (I realize you just refactored this 
code)
   ```suggestion
               let merged_schema =
                   self.all_schemas
                       .iter()
                       .reduce(|mut lhs, rhs| {
                           lhs.merge(rhs);
                           lhs
                       });
   ```



##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -370,7 +391,12 @@ struct ExprIdentifierVisitor<'a> {
     expr_set: &'a mut ExprSet,
     /// series number (usize) and identifier.
     id_array: &'a mut Vec<(usize, Identifier)>,
-    data_type: DataType,
+    /// input schema for the node that we're optimizing, so we can determine 
the correct datatype
+    /// for each subexpression
+    input_schema: DFSchemaRef,
+    /// all schemas in the logical plan, as a fall back if we cannot resolve 
an expression type
+    /// from the input schema alone
+    all_schemas: Vec<DFSchemaRef>,

Review Comment:
   I don't understand in what cases we wouldn't be able to resolve an expr type 
from the input schema alone.
   
   The only case I can think of is when the plan node has more than one input 
(e.g. a Join or a Union) -- but thus I would expect that we always resolve the 
type of the expressions using the input schema



##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -448,7 +474,21 @@ impl ExpressionVisitor for ExprIdentifierVisitor<'_> {
 
         self.id_array[idx] = (self.series_number, desc.clone());
         self.visit_stack.push(VisitRecord::ExprItem(desc.clone()));
-        let data_type = self.data_type.clone();
+
+        let data_type = if let Ok(data_type) = 
expr.get_type(&self.input_schema) {
+            data_type
+        } else {
+            // expression type could not be resolved in schema, fall back to 
all schemas
+            let merged_schema =
+                self.all_schemas
+                    .iter()
+                    .fold(DFSchema::empty(), |mut lhs, rhs| {
+                        lhs.merge(rhs);
+                        lhs
+                    });
+            expr.get_type(&merged_schema)?
+        };

Review Comment:
   I think all exprs should be resolvable with the unified schemas, as 
explained above -- but maybe it is a performance optimization 🤔.
   
   Perhaps you could leave a comment explaining that we are not sure if it is 
necessary 



##########
datafusion/optimizer/src/common_subexpr_eliminate.rs:
##########
@@ -597,22 +640,36 @@ mod test {
     fn id_array_visitor() -> Result<()> {
         let expr = binary_expr(
             binary_expr(
-                sum(binary_expr(col("a"), Operator::Plus, lit("1"))),
+                sum(binary_expr(col("a"), Operator::Plus, lit(1))),

Review Comment:
   I agree this test doesn't make sense as coercion should have happened before 
this pass



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3726: Fix output schema generated by CommonSubExprEliminate

Reply via email to