alamb commented on code in PR #7752:
URL: https://github.com/apache/arrow-datafusion/pull/7752#discussion_r1349511515
##########
datafusion-examples/Cargo.toml:
##########
@@ -41,6 +41,7 @@ datafusion-common = { path = "../datafusion/common" }
datafusion-expr = { path = "../datafusion/expr" }
datafusion-optimizer = { path = "../datafusion/optimizer" }
datafusion-sql = { path = "../datafusion/sql" }
+datafusion-extension-test-scalar-func = {path =
"../extension/scalar-function/test-func"}
Review Comment:
Since most of DataFusion is extensions, another potential might be a path
structure like the following?
```
├── core # (already exists)
├── functions-aggregate-core
├── functions-aggregate-statistics
├── functions-scalar-array
├── functions-scalar-core
├── functions-scalar-crypto
├── functions-scalar-timestamp
├── physical-expr # (already exists)
...
└── physical-plan # (already exists)
```
##########
datafusion/expr/src/udf.rs:
##########
@@ -18,11 +18,30 @@
//! Udf module contains foundational types that are used to represent UDFs in
DataFusion.
use crate::{Expr, ReturnTypeFunction, ScalarFunctionImplementation, Signature};
+use arrow::array::ArrayRef;
+use datafusion_common::Result;
use std::fmt;
use std::fmt::Debug;
use std::fmt::Formatter;
use std::sync::Arc;
+pub trait ScalarFunctionDef: Sync + Send + std::fmt::Debug {
+ // TODO: support alias
+ fn name(&self) -> &str;
+
+ fn signature(&self) -> Signature;
+
+ // TODO: ReturnTypeFunction -> a ENUM
+ // most function's return type is either the same as 1st arg or a
fixed type
+ fn return_type(&self) -> ReturnTypeFunction;
Review Comment:
I think the current API requires return_type to be passed the specific
argument types
Also, https://github.com/apache/arrow-datafusion/discussions/7657 suggests
there is additional room for improvement
##########
datafusion-examples/Cargo.toml:
##########
@@ -41,6 +41,7 @@ datafusion-common = { path = "../datafusion/common" }
datafusion-expr = { path = "../datafusion/expr" }
datafusion-optimizer = { path = "../datafusion/optimizer" }
datafusion-sql = { path = "../datafusion/sql" }
+datafusion-extension-test-scalar-func = {path =
"../extension/scalar-function/test-func"}
Review Comment:
Since most of DataFusion is extensions, another potential might be a path
structure like the following?
```
├── core # (already exists)
├── functions-aggregate-core
├── functions-aggregate-statistics
├── functions-scalar-array
├── functions-scalar-core
├── functions-scalar-crypto
├── functions-scalar-timestamp
├── physical-expr # (already exists)
...
└── physical-plan # (already exists)
```
##########
datafusion/expr/src/udf.rs:
##########
@@ -18,11 +18,30 @@
//! Udf module contains foundational types that are used to represent UDFs in
DataFusion.
use crate::{Expr, ReturnTypeFunction, ScalarFunctionImplementation, Signature};
+use arrow::array::ArrayRef;
+use datafusion_common::Result;
use std::fmt;
use std::fmt::Debug;
use std::fmt::Formatter;
use std::sync::Arc;
+pub trait ScalarFunctionDef: Sync + Send + std::fmt::Debug {
Review Comment:
I really like using a trait to define functions. To be honest, I am not sure
why DataFusion didn't so that originally.
My only concern is that adding this we would now have *three* ways to define
scalar functions (the ScalarUDF, `ScalarFunctionDef` and
`BuiltInScalarFunction`).
I wonder if this trait is needed, or can we extend `ScalarUDF` to account
for aliases?
Or perhaps should we be aiming to consolidate *all* functions to use
`ScalarFuntionDef` 🤔
##########
datafusion/core/src/execution/context.rs:
##########
@@ -792,6 +794,30 @@ impl SessionContext {
.add_var_provider(variable_type, provider);
}
+ /// Register a function package into this context
+ pub fn register_scalar_function_package(
Review Comment:
If we are going to implement a new API, maybe we can have it cover all types
of functions (window, aggregate, scalar) so there is a place to add table
functions eventually too:
Something like
```
pub fn register_function_package(&self, func_pkg: Box<dyn
FunctionPackage>)
```
Also I wonder if it should take `Box<dyn FunctionPackage>` (which is owned)
or `Arc<dyn FunctionPackage>` which can be shared. Looking at this code, it
seems like there is no reason for an owned pointer and we could get away with
`Arc` (so function packages can be quickly registered with multiple
`SessionContext`s)
##########
datafusion/expr/src/udf.rs:
##########
@@ -18,11 +18,30 @@
//! Udf module contains foundational types that are used to represent UDFs in
DataFusion.
use crate::{Expr, ReturnTypeFunction, ScalarFunctionImplementation, Signature};
+use arrow::array::ArrayRef;
+use datafusion_common::Result;
use std::fmt;
use std::fmt::Debug;
use std::fmt::Formatter;
use std::sync::Arc;
+pub trait ScalarFunctionDef: Sync + Send + std::fmt::Debug {
+ // TODO: support alias
+ fn name(&self) -> &str;
+
+ fn signature(&self) -> Signature;
+
+ // TODO: ReturnTypeFunction -> a ENUM
+ // most function's return type is either the same as 1st arg or a
fixed type
+ fn return_type(&self) -> ReturnTypeFunction;
+
+ fn execute(&self, args: &[ArrayRef]) -> Result<ArrayRef>;
+}
+
+pub trait ScalarFunctionPackage {
+ fn functions(&self) -> Vec<Box<dyn ScalarFunctionDef>>;
Review Comment:
I suggest using `Arc` here as I don't see any reason for the to be state /
need an owned copy
##########
datafusion/expr/src/udf.rs:
##########
@@ -18,11 +18,30 @@
//! Udf module contains foundational types that are used to represent UDFs in
DataFusion.
use crate::{Expr, ReturnTypeFunction, ScalarFunctionImplementation, Signature};
+use arrow::array::ArrayRef;
+use datafusion_common::Result;
use std::fmt;
use std::fmt::Debug;
use std::fmt::Formatter;
use std::sync::Arc;
+pub trait ScalarFunctionDef: Sync + Send + std::fmt::Debug {
Review Comment:
I really like using a trait to define functions. To be honest, I am not sure
why DataFusion didn't so that originally.
My only concern is that adding this we would now have *three* ways to define
scalar functions (the ScalarUDF, `ScalarFunctionDef` and
`BuiltInScalarFunction`).
I wonder if this trait is needed, or can we extend `ScalarUDF` to account
for aliases?
Or perhaps should we be aiming to consolidate *all* functions to use
`ScalarFuntionDef` 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]