timsaucer commented on code in PR #1287:
URL:
https://github.com/apache/datafusion-python/pull/1287#discussion_r2452385320
##########
python/datafusion/user_defined.py:
##########
@@ -164,10 +270,13 @@ def udf(*args: Any, **kwargs: Any): # noqa: D417
backed ScalarUDF within a PyCapsule, you can pass this
parameter
and ignore the rest. They will be determined directly from the
underlying function. See the online documentation for more
information.
- input_types (list[pa.DataType]): The data types of the arguments
- to ``func``. This list must be of the same length as the
number of
- arguments.
- return_type (_R): The data type of the return value from the
function.
+ The callable should accept and return :class:`pyarrow.Array` or
+ :class:`pyarrow.ChunkedArray` values.
+ input_types (list[pa.DataType | pa.Field]): The argument types for
``func``.
+ This list must be of the same length as the number of
arguments. Pass
+ :class:`pyarrow.Field` instances to preserve extension
metadata.
+ return_type (pa.DataType | pa.Field): The return type of the
function. Use a
+ :class:`pyarrow.Field` to preserve metadata on extension
arrays.
Review Comment:
I think this comment is misleading. I do not think there is any guarantee
that the output field metadata will be preserved. Instead this should be the
way in which you can set output metadata. I think it is entirely possible that
a UDF implemented like this can still lose the metadata. One case is where you
want to consume it on the input side and output some *different* kind of
metadata on your output side.
##########
python/tests/test_udf.py:
##########
@@ -124,3 +130,94 @@ def udf_with_param(values: pa.Array) -> pa.Array:
result = df2.collect()[0].column(0)
assert result == pa.array([False, True, True])
+
+
[email protected](
+ not UUID_EXTENSION_AVAILABLE,
+ reason="PyArrow uuid extension helper unavailable",
+)
Review Comment:
Can we make the uuid extension a requirement in our developer dependencies?
##########
src/udf.rs:
##########
@@ -80,6 +81,83 @@ fn to_scalar_function_impl(func: PyObject) ->
ScalarFunctionImplementation {
})
}
+#[derive(PartialEq, Eq, Hash)]
+struct PySimpleScalarUDF {
+ name: String,
+ signature: Signature,
+ return_field: Arc<Field>,
+ fun: PtrEq<ScalarFunctionImplementation>,
+}
+
+impl fmt::Debug for PySimpleScalarUDF {
+ fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+ f.debug_struct("PySimpleScalarUDF")
+ .field("name", &self.name)
+ .field("signature", &self.signature)
+ .field("return_field", &self.return_field)
+ .finish()
+ }
+}
+
+impl PySimpleScalarUDF {
+ fn new(
+ name: impl Into<String>,
+ input_fields: Vec<Field>,
+ return_field: Field,
+ volatility: Volatility,
+ fun: ScalarFunctionImplementation,
+ ) -> Self {
+ let signature_types = input_fields
+ .into_iter()
+ .map(|field| field.data_type().clone())
+ .collect();
+ let signature = Signature::exact(signature_types, volatility);
+ Self {
+ name: name.into(),
+ signature,
+ return_field: Arc::new(return_field),
+ fun: fun.into(),
+ }
+ }
+}
+
+impl ScalarUDFImpl for PySimpleScalarUDF {
+ fn as_any(&self) -> &dyn std::any::Any {
+ self
+ }
+
+ fn name(&self) -> &str {
+ &self.name
+ }
+
+ fn signature(&self) -> &Signature {
+ &self.signature
+ }
+
+ fn return_type(&self, _arg_types: &[DataType]) ->
datafusion::error::Result<DataType> {
+ Ok(self.return_field.data_type().clone())
Review Comment:
Best practice is to not implement this method. Per the upstream datafusion
documentation:
> If you provide an implementation for Self::return_field_from_args,
DataFusion will not call return_type (this function). In such cases is
recommended to return DataFusionError::Internal.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]