In my opinion, I suggest we do not continue down the path of (runtime)
polymorphic functions unless a compelling use case for them can be
articulated.

You have done a great job articulating some of the implementation
challenges, but I personally struggle to describe when, as a user of
DataFusion, I would want to write a (runtime) polymorphic function.

A function with runtime polymorphism I think would mean the UDF could
handle the type changing *at runtime*: record batches could come in with
multiple different types during the same execution. I can't think of
examples where this behavior would be desirable or necessary.

The existing DataFusion codebase seems to assume (reasonably in my opinion)
that the schema of each Logical / Physical plan node is known at planning
time and it does not change at runtime.

Most query optimizers (and compilers for that matter) take advantage of
plan (compile) time type information to make runtime more efficient.  Also,
it seems like other database / runtime systems such as mysql[1] and
postgres[2] require the UDF creator to explicitly specify the return type
as well. I think we should consider the simpler semantics of "1 return type
for each UDF" to make it easier on people writing UDFs as well as
simplifying the implementation of DataFusion itself.

Andrew

[1] https://dev.mysql.com/doc/refman/8.0/en/create-function-udf.html
[2] https://www.postgresql.org/docs/12/sql-createfunction.html

On Mon, Aug 17, 2020 at 12:31 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Recently, I have been contributing to DataFusion, and I would like to bring
> to your attention a question that I faced while PRing to DataFusion that
> IMO needs some alignment :)
>
> DataFusion supports scalar UDFs: functions that expect a type, return a
> type, and performs some operation on the data (a-la spark UDF). However,
> the execution engine is actually dynamically typed:
>
> * a scalar UDF receives an &[ArrayRef] that must be downcasted accordingly
> * a scalar UDF must select the builder that matches its signature, so that
> its return type matches the ArrayRef that it returns.
>
> This suggests that we can treat functions as polymorphic: as long as the
> function handles the different types (e.g. via match), we are good. We
> currently do not support multiple input types nor variable return types in
> their function signatures.
>
> Our current (non-udf) scalar and aggregate functions are already
> polymorphic on both their input and return type: sum(i32) -> i64, sum(f64)
> -> f64, "a + b". I have been working on PRs to support polymorphic support
> to scalar UDFs (e.g. sqrt() can take float32 and float64) [1,3], as well as
> polymorphic aggregate UDFs [2], so that we can extend our offering to more
> interesting functions such as "length(t) -> uint", "array(c1, c2)",
> "collect_list(t) -> array(t)", etc.
>
> However, while working on [1,2,3], I reach some non-trivial findings that I
> would like to share:
>
> Finding 1: to support polymorphic functions, our logical and physical
> expressions (Expr and PhysicalExpr) need to be polymorphic as-well: once a
> function is polymorphic, any expression containing it is also polymorphic.
>
> Finding 2: when a polymorphic expression passes through our type coercer
> optimizer (that tries to coerce types to match a function's signature), it
> may be re-casted to a different type. If the return type changes, the
> optimizer may need to re-cast operations dependent of the function call
> (e.g. a projection followed by an aggregation may need a recast on the
> projection and on the aggregation).
>
> Finding 3: when an expression passes through our type coercer optimizer and
> is re-casted, its name changes (typically from "expr" to "CAST(expr as
> X)"). This implies that a column referenced as #expr down the plan may not
> exist depending on the input type of the initial projection/scan.
>
> Finding 1 and 2 IMO are a direct consequence of polymorphism and the only
> way to not handle them is by not supporting polymorphism (e.g. the user
> registers sqrt_f32 and sqrt_f64, etc).
>
> Finding 3 can be addressed in at least three ways:
>
> A) make the optimizer rewrite the expression as "CAST(expr as X) AS expr",
> so that it retains its original name. This hides the actual expression's
> calculation, but preserves its original name.
> B) accept that expressions can always change its name, which means that the
> user should be mindful when writing `col("SELECT sqrt(x) FROM t"`, as the
> column name may end up being called `"sqrt(CAST(x as X))"`.
> C) Do not support polymorphic functions
>
> Note that we currently already experience effects 1-3, it is just that we
> use so few polymorphic functions that these seldomly present themselves. It
> was while working on [1,2,3] that I start painting the bigger picture.
>
> Some questions:
> 1. should continue down the path of polymorphic functions?
> 2. if yes, how do handle finding 3?
>
> Looking at the current code base, I am confident that we can address the
> technical issues to support polymorphic functions. However, it would be
> interesting to have your thoughts on this.
>
> [1] https://github.com/apache/arrow/pull/7967
> [2] https://github.com/apache/arrow/pull/7971
> [3] https://github.com/apache/arrow/pull/7974
>

Reply via email to