Re: [PR] Fix the schema mismatch between logical and physical for aggregate function [datafusion]

via GitHub Fri, 16 Aug 2024 08:37:25 -0700


alamb commented on code in PR #11989:
URL: https://github.com/apache/datafusion/pull/11989#discussion_r1719999132



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -328,10 +328,45 @@ impl ExprSchemable for Expr {
                     Ok(true)
                 }
             }
+            Expr::WindowFunction(WindowFunction { fun, .. }) => {
+                match fun {
+                    WindowFunctionDefinition::BuiltInWindowFunction(func) => {
+                        if func.name() == "ROW_NUMBER"
+                            || func.name() == "RANK"
+                            || func.name() == "NTILE"
+                            || func.name() == "CUME_DIST"
+                        {
+                            Ok(false)
+                        } else {
+                            Ok(true)
+                        }
+                    }
+                    WindowFunctionDefinition::AggregateUDF(func) => {
+                        // TODO: UDF should be able to customize nullability
+                        if func.name() == "count" {
+                            // TODO: there is issue unsolved for count with 
window, should return false
+                            Ok(true)
+                        } else {
+                            Ok(true)
+                        }
+                    }
+                    _ => Ok(true),
+                }
+            }
+            Expr::ScalarFunction(ScalarFunction { func, args }) => {
+                // If all the element in coalesce is non-null, the result is 
non-null

Review Comment:
   We should probably add an API to ScalarUDFImpl to signal its 
null/non-nullness (as a follow on PR) instead of hard coding this function name
   
   ```
        func.is_nullable(args)
   ```



##########
datafusion/functions-aggregate-common/src/aggregate.rs:
##########
@@ -171,6 +171,9 @@ pub trait AggregateExpr: Send + Sync + Debug + 
PartialEq<dyn Any> {
     fn get_minmax_desc(&self) -> Option<(Field, bool)> {
         None
     }
+
+    /// Get function's name, for example `count(x)` returns `count`
+    fn func_name(&self) -> &str;

Review Comment:
   is there a reason this isn't `name()` ? `func_name` is fine, it just seems 
inconsistent with the rest of the code



##########
datafusion/expr/src/udaf.rs:
##########
@@ -196,6 +196,10 @@ impl AggregateUDF {
         self.inner.state_fields(args)
     }
 
+    pub fn fields(&self, args: StateFieldsArgs) -> Result<Field> {

Review Comment:
   Could we document this function and what it is for (also in 
AggregateUdfImpl)? 
   
   Also, the name is strange to me -- it is `fields` but it returns a single 
`Field` and the corresponding function on `AggregateUDFImpl` is called `field` 
(no `s`) 🤔 



##########
datafusion/optimizer/src/analyzer/type_coercion.rs:
##########
@@ -833,39 +841,47 @@ fn coerce_union_schema(inputs: &[Arc<LogicalPlan>]) -> 
Result<DFSchema> {
                 plan_schema.fields().len()
             );
         }
-        // coerce data type and nullablity for each field
-        for (union_datatype, union_nullable, plan_field) in izip!(
-            union_datatypes.iter_mut(),
-            union_nullabilities.iter_mut(),
-            plan_schema.fields()
-        ) {
-            let coerced_type =
-                comparison_coercion(union_datatype, 
plan_field.data_type()).ok_or_else(
-                    || {
-                        plan_datafusion_err!(
-                            "Incompatible inputs for Union: Previous inputs 
were \
-                            of type {}, but got incompatible type {} on column 
'{}'",
-                            union_datatype,
-                            plan_field.data_type(),
-                            plan_field.name()
-                        )
-                    },
-                )?;
-            *union_datatype = coerced_type;
-            *union_nullable = *union_nullable || plan_field.is_nullable();
+
+        // Safety: Length is checked
+        unsafe {

Review Comment:
   I think this unsafe block is  unecessary -- this isn't a performance 
critical piece of code. I think `izip` or just manuallly `zip`ping three times 
would be better



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -328,10 +328,45 @@ impl ExprSchemable for Expr {
                     Ok(true)
                 }
             }
+            Expr::WindowFunction(WindowFunction { fun, .. }) => {

Review Comment:
   Is this change required for this PR or is it a  "drive by" improvement?



##########
datafusion/core/src/physical_planner.rs:
##########
@@ -670,6 +670,10 @@ impl DefaultPhysicalPlanner {
                 let input_exec = children.one()?;
                 let physical_input_schema = input_exec.schema();
                 let logical_input_schema = input.as_ref().schema();
+                let physical_input_schema_from_logical: Arc<Schema> =
+                    logical_input_schema.as_ref().clone().into();
+
+                debug_assert_eq!(physical_input_schema_from_logical, 
physical_input_schema, "Physical input schema should be the same as the one 
converted from logical input schema. Please file an issue or send the PR");

Review Comment:
   Nice!
   
   Did you consider making this function return an `internal_error` rather than 
debug_assert ? 
   
   If we are concerned about breaking existing tests, we could add a config 
setting like `datafusion.optimizer.skip_failed_rules` to let users bypass the 
check



##########
datafusion/physical-expr/src/window/built_in.rs:
##########
@@ -97,6 +97,10 @@ impl BuiltInWindowExpr {
 }
 
 impl WindowExpr for BuiltInWindowExpr {
+    fn func_name(&self) -> Result<&str> {
+        not_impl_err!("function name not determined")

Review Comment:
   why wouldn't we implement func_name for a built in window function 🤔 



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -328,10 +328,45 @@ impl ExprSchemable for Expr {
                     Ok(true)
                 }
             }
+            Expr::WindowFunction(WindowFunction { fun, .. }) => {
+                match fun {
+                    WindowFunctionDefinition::BuiltInWindowFunction(func) => {
+                        if func.name() == "ROW_NUMBER"
+                            || func.name() == "RANK"
+                            || func.name() == "NTILE"
+                            || func.name() == "CUME_DIST"
+                        {
+                            Ok(false)
+                        } else {
+                            Ok(true)
+                        }
+                    }
+                    WindowFunctionDefinition::AggregateUDF(func) => {
+                        // TODO: UDF should be able to customize nullability
+                        if func.name() == "count" {
+                            // TODO: there is issue unsolved for count with 
window, should return false

Review Comment:
   
   Perhaps we can file a ticket to track this -- ideally it would eventually be 
part of the window function definition itself rather than relying on names



##########
datafusion/physical-expr/src/window/aggregate.rs:
##########
@@ -80,6 +80,14 @@ impl WindowExpr for PlainAggregateWindowExpr {
     }
 
     fn field(&self) -> Result<Field> {
+        // TODO: Fix window function to always return non-null for count

Review Comment:
   I don't understand this comment -- can we please file a ticket to track it 
(and add the ticket reference to the comments)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix the schema mismatch between logical and physical for aggregate function [datafusion]

Reply via email to