alamb commented on code in PR #11989: URL: https://github.com/apache/datafusion/pull/11989#discussion_r1719999132
########## datafusion/expr/src/expr_schema.rs: ########## @@ -328,10 +328,45 @@ impl ExprSchemable for Expr { Ok(true) } } + Expr::WindowFunction(WindowFunction { fun, .. }) => { + match fun { + WindowFunctionDefinition::BuiltInWindowFunction(func) => { + if func.name() == "ROW_NUMBER" + || func.name() == "RANK" + || func.name() == "NTILE" + || func.name() == "CUME_DIST" + { + Ok(false) + } else { + Ok(true) + } + } + WindowFunctionDefinition::AggregateUDF(func) => { + // TODO: UDF should be able to customize nullability + if func.name() == "count" { + // TODO: there is issue unsolved for count with window, should return false + Ok(true) + } else { + Ok(true) + } + } + _ => Ok(true), + } + } + Expr::ScalarFunction(ScalarFunction { func, args }) => { + // If all the element in coalesce is non-null, the result is non-null Review Comment: We should probably add an API to ScalarUDFImpl to signal its null/non-nullness (as a follow on PR) instead of hard coding this function name ``` func.is_nullable(args) ``` ########## datafusion/functions-aggregate-common/src/aggregate.rs: ########## @@ -171,6 +171,9 @@ pub trait AggregateExpr: Send + Sync + Debug + PartialEq<dyn Any> { fn get_minmax_desc(&self) -> Option<(Field, bool)> { None } + + /// Get function's name, for example `count(x)` returns `count` + fn func_name(&self) -> &str; Review Comment: is there a reason this isn't `name()` ? `func_name` is fine, it just seems inconsistent with the rest of the code ########## datafusion/expr/src/udaf.rs: ########## @@ -196,6 +196,10 @@ impl AggregateUDF { self.inner.state_fields(args) } + pub fn fields(&self, args: StateFieldsArgs) -> Result<Field> { Review Comment: Could we document this function and what it is for (also in AggregateUdfImpl)? Also, the name is strange to me -- it is `fields` but it returns a single `Field` and the corresponding function on `AggregateUDFImpl` is called `field` (no `s`) 🤔 ########## datafusion/optimizer/src/analyzer/type_coercion.rs: ########## @@ -833,39 +841,47 @@ fn coerce_union_schema(inputs: &[Arc<LogicalPlan>]) -> Result<DFSchema> { plan_schema.fields().len() ); } - // coerce data type and nullablity for each field - for (union_datatype, union_nullable, plan_field) in izip!( - union_datatypes.iter_mut(), - union_nullabilities.iter_mut(), - plan_schema.fields() - ) { - let coerced_type = - comparison_coercion(union_datatype, plan_field.data_type()).ok_or_else( - || { - plan_datafusion_err!( - "Incompatible inputs for Union: Previous inputs were \ - of type {}, but got incompatible type {} on column '{}'", - union_datatype, - plan_field.data_type(), - plan_field.name() - ) - }, - )?; - *union_datatype = coerced_type; - *union_nullable = *union_nullable || plan_field.is_nullable(); + + // Safety: Length is checked + unsafe { Review Comment: I think this unsafe block is unecessary -- this isn't a performance critical piece of code. I think `izip` or just manuallly `zip`ping three times would be better ########## datafusion/expr/src/expr_schema.rs: ########## @@ -328,10 +328,45 @@ impl ExprSchemable for Expr { Ok(true) } } + Expr::WindowFunction(WindowFunction { fun, .. }) => { Review Comment: Is this change required for this PR or is it a "drive by" improvement? ########## datafusion/core/src/physical_planner.rs: ########## @@ -670,6 +670,10 @@ impl DefaultPhysicalPlanner { let input_exec = children.one()?; let physical_input_schema = input_exec.schema(); let logical_input_schema = input.as_ref().schema(); + let physical_input_schema_from_logical: Arc<Schema> = + logical_input_schema.as_ref().clone().into(); + + debug_assert_eq!(physical_input_schema_from_logical, physical_input_schema, "Physical input schema should be the same as the one converted from logical input schema. Please file an issue or send the PR"); Review Comment: Nice! Did you consider making this function return an `internal_error` rather than debug_assert ? If we are concerned about breaking existing tests, we could add a config setting like `datafusion.optimizer.skip_failed_rules` to let users bypass the check ########## datafusion/physical-expr/src/window/built_in.rs: ########## @@ -97,6 +97,10 @@ impl BuiltInWindowExpr { } impl WindowExpr for BuiltInWindowExpr { + fn func_name(&self) -> Result<&str> { + not_impl_err!("function name not determined") Review Comment: why wouldn't we implement func_name for a built in window function 🤔 ########## datafusion/expr/src/expr_schema.rs: ########## @@ -328,10 +328,45 @@ impl ExprSchemable for Expr { Ok(true) } } + Expr::WindowFunction(WindowFunction { fun, .. }) => { + match fun { + WindowFunctionDefinition::BuiltInWindowFunction(func) => { + if func.name() == "ROW_NUMBER" + || func.name() == "RANK" + || func.name() == "NTILE" + || func.name() == "CUME_DIST" + { + Ok(false) + } else { + Ok(true) + } + } + WindowFunctionDefinition::AggregateUDF(func) => { + // TODO: UDF should be able to customize nullability + if func.name() == "count" { + // TODO: there is issue unsolved for count with window, should return false Review Comment: Perhaps we can file a ticket to track this -- ideally it would eventually be part of the window function definition itself rather than relying on names ########## datafusion/physical-expr/src/window/aggregate.rs: ########## @@ -80,6 +80,14 @@ impl WindowExpr for PlainAggregateWindowExpr { } fn field(&self) -> Result<Field> { + // TODO: Fix window function to always return non-null for count Review Comment: I don't understand this comment -- can we please file a ticket to track it (and add the ticket reference to the comments)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org