timsaucer commented on code in PR #1287:
URL: 
https://github.com/apache/datafusion-python/pull/1287#discussion_r2452385320


##########
python/datafusion/user_defined.py:
##########
@@ -164,10 +270,13 @@ def udf(*args: Any, **kwargs: Any):  # noqa: D417
                 backed ScalarUDF within a PyCapsule, you can pass this 
parameter
                 and ignore the rest. They will be determined directly from the
                 underlying function. See the online documentation for more 
information.
-            input_types (list[pa.DataType]): The data types of the arguments
-                to ``func``. This list must be of the same length as the 
number of
-                arguments.
-            return_type (_R): The data type of the return value from the 
function.
+                The callable should accept and return :class:`pyarrow.Array` or
+                :class:`pyarrow.ChunkedArray` values.
+            input_types (list[pa.DataType | pa.Field]): The argument types for 
``func``.
+                This list must be of the same length as the number of 
arguments. Pass
+                :class:`pyarrow.Field` instances to preserve extension 
metadata.
+            return_type (pa.DataType | pa.Field): The return type of the 
function. Use a
+                :class:`pyarrow.Field` to preserve metadata on extension 
arrays.

Review Comment:
   I think this comment is misleading. I do not think there is any guarantee 
that the output field metadata will be preserved. Instead this should be the 
way in which you can set output metadata. I think it is entirely possible that 
a UDF implemented like this can still lose the metadata. One case is where you 
want to consume it on the input side and output some *different* kind of 
metadata on your output side.



##########
python/tests/test_udf.py:
##########
@@ -124,3 +130,94 @@ def udf_with_param(values: pa.Array) -> pa.Array:
     result = df2.collect()[0].column(0)
 
     assert result == pa.array([False, True, True])
+
+
[email protected](
+    not UUID_EXTENSION_AVAILABLE,
+    reason="PyArrow uuid extension helper unavailable",
+)

Review Comment:
   Can we make the uuid extension a requirement in our developer dependencies?



##########
src/udf.rs:
##########
@@ -80,6 +81,83 @@ fn to_scalar_function_impl(func: PyObject) -> 
ScalarFunctionImplementation {
     })
 }
 
+#[derive(PartialEq, Eq, Hash)]
+struct PySimpleScalarUDF {
+    name: String,
+    signature: Signature,
+    return_field: Arc<Field>,
+    fun: PtrEq<ScalarFunctionImplementation>,
+}
+
+impl fmt::Debug for PySimpleScalarUDF {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        f.debug_struct("PySimpleScalarUDF")
+            .field("name", &self.name)
+            .field("signature", &self.signature)
+            .field("return_field", &self.return_field)
+            .finish()
+    }
+}
+
+impl PySimpleScalarUDF {
+    fn new(
+        name: impl Into<String>,
+        input_fields: Vec<Field>,
+        return_field: Field,
+        volatility: Volatility,
+        fun: ScalarFunctionImplementation,
+    ) -> Self {
+        let signature_types = input_fields
+            .into_iter()
+            .map(|field| field.data_type().clone())
+            .collect();
+        let signature = Signature::exact(signature_types, volatility);
+        Self {
+            name: name.into(),
+            signature,
+            return_field: Arc::new(return_field),
+            fun: fun.into(),
+        }
+    }
+}
+
+impl ScalarUDFImpl for PySimpleScalarUDF {
+    fn as_any(&self) -> &dyn std::any::Any {
+        self
+    }
+
+    fn name(&self) -> &str {
+        &self.name
+    }
+
+    fn signature(&self) -> &Signature {
+        &self.signature
+    }
+
+    fn return_type(&self, _arg_types: &[DataType]) -> 
datafusion::error::Result<DataType> {
+        Ok(self.return_field.data_type().clone())

Review Comment:
   Best practice is to not implement this method. Per the upstream datafusion 
documentation:
   
   > If you provide an implementation for Self::return_field_from_args, 
DataFusion will not call return_type (this function). In such cases is 
recommended to return DataFusionError::Internal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to