Hi, Recently, I have been contributing to DataFusion, and I would like to bring to your attention a question that I faced while PRing to DataFusion that IMO needs some alignment :)
DataFusion supports scalar UDFs: functions that expect a type, return a type, and performs some operation on the data (a-la spark UDF). However, the execution engine is actually dynamically typed: * a scalar UDF receives an &[ArrayRef] that must be downcasted accordingly * a scalar UDF must select the builder that matches its signature, so that its return type matches the ArrayRef that it returns. This suggests that we can treat functions as polymorphic: as long as the function handles the different types (e.g. via match), we are good. We currently do not support multiple input types nor variable return types in their function signatures. Our current (non-udf) scalar and aggregate functions are already polymorphic on both their input and return type: sum(i32) -> i64, sum(f64) -> f64, "a + b". I have been working on PRs to support polymorphic support to scalar UDFs (e.g. sqrt() can take float32 and float64) [1,3], as well as polymorphic aggregate UDFs [2], so that we can extend our offering to more interesting functions such as "length(t) -> uint", "array(c1, c2)", "collect_list(t) -> array(t)", etc. However, while working on [1,2,3], I reach some non-trivial findings that I would like to share: Finding 1: to support polymorphic functions, our logical and physical expressions (Expr and PhysicalExpr) need to be polymorphic as-well: once a function is polymorphic, any expression containing it is also polymorphic. Finding 2: when a polymorphic expression passes through our type coercer optimizer (that tries to coerce types to match a function's signature), it may be re-casted to a different type. If the return type changes, the optimizer may need to re-cast operations dependent of the function call (e.g. a projection followed by an aggregation may need a recast on the projection and on the aggregation). Finding 3: when an expression passes through our type coercer optimizer and is re-casted, its name changes (typically from "expr" to "CAST(expr as X)"). This implies that a column referenced as #expr down the plan may not exist depending on the input type of the initial projection/scan. Finding 1 and 2 IMO are a direct consequence of polymorphism and the only way to not handle them is by not supporting polymorphism (e.g. the user registers sqrt_f32 and sqrt_f64, etc). Finding 3 can be addressed in at least three ways: A) make the optimizer rewrite the expression as "CAST(expr as X) AS expr", so that it retains its original name. This hides the actual expression's calculation, but preserves its original name. B) accept that expressions can always change its name, which means that the user should be mindful when writing `col("SELECT sqrt(x) FROM t"`, as the column name may end up being called `"sqrt(CAST(x as X))"`. C) Do not support polymorphic functions Note that we currently already experience effects 1-3, it is just that we use so few polymorphic functions that these seldomly present themselves. It was while working on [1,2,3] that I start painting the bigger picture. Some questions: 1. should continue down the path of polymorphic functions? 2. if yes, how do handle finding 3? Looking at the current code base, I am confident that we can address the technical issues to support polymorphic functions. However, it would be interesting to have your thoughts on this. [1] https://github.com/apache/arrow/pull/7967 [2] https://github.com/apache/arrow/pull/7971 [3] https://github.com/apache/arrow/pull/7974