Hi,

Recently, I have been contributing to DataFusion, and I would like to bring
to your attention a question that I faced while PRing to DataFusion that
IMO needs some alignment :)

DataFusion supports scalar UDFs: functions that expect a type, return a
type, and performs some operation on the data (a-la spark UDF). However,
the execution engine is actually dynamically typed:

* a scalar UDF receives an &[ArrayRef] that must be downcasted accordingly
* a scalar UDF must select the builder that matches its signature, so that
its return type matches the ArrayRef that it returns.

This suggests that we can treat functions as polymorphic: as long as the
function handles the different types (e.g. via match), we are good. We
currently do not support multiple input types nor variable return types in
their function signatures.

Our current (non-udf) scalar and aggregate functions are already
polymorphic on both their input and return type: sum(i32) -> i64, sum(f64)
-> f64, "a + b". I have been working on PRs to support polymorphic support
to scalar UDFs (e.g. sqrt() can take float32 and float64) [1,3], as well as
polymorphic aggregate UDFs [2], so that we can extend our offering to more
interesting functions such as "length(t) -> uint", "array(c1, c2)",
"collect_list(t) -> array(t)", etc.

However, while working on [1,2,3], I reach some non-trivial findings that I
would like to share:

Finding 1: to support polymorphic functions, our logical and physical
expressions (Expr and PhysicalExpr) need to be polymorphic as-well: once a
function is polymorphic, any expression containing it is also polymorphic.

Finding 2: when a polymorphic expression passes through our type coercer
optimizer (that tries to coerce types to match a function's signature), it
may be re-casted to a different type. If the return type changes, the
optimizer may need to re-cast operations dependent of the function call
(e.g. a projection followed by an aggregation may need a recast on the
projection and on the aggregation).

Finding 3: when an expression passes through our type coercer optimizer and
is re-casted, its name changes (typically from "expr" to "CAST(expr as
X)"). This implies that a column referenced as #expr down the plan may not
exist depending on the input type of the initial projection/scan.

Finding 1 and 2 IMO are a direct consequence of polymorphism and the only
way to not handle them is by not supporting polymorphism (e.g. the user
registers sqrt_f32 and sqrt_f64, etc).

Finding 3 can be addressed in at least three ways:

A) make the optimizer rewrite the expression as "CAST(expr as X) AS expr",
so that it retains its original name. This hides the actual expression's
calculation, but preserves its original name.
B) accept that expressions can always change its name, which means that the
user should be mindful when writing `col("SELECT sqrt(x) FROM t"`, as the
column name may end up being called `"sqrt(CAST(x as X))"`.
C) Do not support polymorphic functions

Note that we currently already experience effects 1-3, it is just that we
use so few polymorphic functions that these seldomly present themselves. It
was while working on [1,2,3] that I start painting the bigger picture.

Some questions:
1. should continue down the path of polymorphic functions?
2. if yes, how do handle finding 3?

Looking at the current code base, I am confident that we can address the
technical issues to support polymorphic functions. However, it would be
interesting to have your thoughts on this.

[1] https://github.com/apache/arrow/pull/7967
[2] https://github.com/apache/arrow/pull/7971
[3] https://github.com/apache/arrow/pull/7974

Reply via email to