[PR] Preserve PyArrow extension metadata when chaining Python scalar UDFs [datafusion-python]

via GitHub Wed, 22 Oct 2025 03:41:49 -0700


kosiew opened a new pull request, #1287:
URL: https://github.com/apache/datafusion-python/pull/1287


   
   
   ## Which issue does this PR close?
   
   Closes #1172
   ---
   
   ## Rationale for this change
   
   When multiple scalar UDFs are chained in Python, the intermediate results 
lose PyArrow extension metadata.
   This happens because the existing binding only passed 
`arrow::datatypes::DataType` to Rust’s `ScalarUDF`, discarding extension 
information embedded in `pyarrow.Field`.
   
   This patch ensures that DataFusion’s Python UDF layer preserves the 
**complete field metadata**, allowing extension arrays (e.g. `arrow.uuid`, 
custom logical types) to survive round-trips between Python and Rust.
   
   ---
   
   ## What changes are included in this PR?
   
   ### 🔧 Python (`python/datafusion/user_defined.py`)
   
   * Introduced `PyArrowArray` and `PyArrowArrayT` aliases for unified typing 
of `Array` and `ChunkedArray`.
   * Added normalization utilities:
   
     * `_normalize_field`, `_normalize_input_fields`, `_normalize_return_field`
     * `_wrap_extension_value` and `_wrap_udf_function` to automatically 
re-wrap extension arrays on UDF input/output.
   * Updated `ScalarUDF` constructor and decorator overloads to accept both 
`pa.Field` and `pa.DataType` objects.
   * Ensured `ScalarUDF` passes fully qualified `Field` objects (with metadata) 
to the internal layer.
   
   ### 🧰 Rust (`src/udf.rs`)
   
   * Added a new `PySimpleScalarUDF` implementing `ScalarUDFImpl`:
   
     * Preserves `arrow::datatypes::Field` for inputs and return values.
     * Implements `return_field_from_args` to keep field names and extension 
metadata.
   * Updated the PyO3 binding to accept and expose `Vec<Field>` instead of 
`Vec<DataType>`.
   * Refactored construction to use `ScalarUDF::new_from_impl()`.
   
   ### 🤖 Tests (`python/tests/test_udf.py`)
   
   * Added `test_uuid_extension_chain` verifying that:
   
     * Chained UDFs correctly round-trip `arrow.uuid` arrays.
     * Empty extension arrays are handled without type loss.
     * UDF input/output extension metadata remains intact.
   
   ---
   
   ## Are these changes tested?
   
   ✅ Yes.
   The new test suite `test_uuid_extension_chain` explicitly covers:
   
   * Chaining of UUID extension UDFs.
   * Handling of empty extension arrays.
   * Type preservation between UDF boundaries.
     Existing decorator and parameterized UDF tests remain intact and continue 
to pass.
   
   ---
   
   ## Are there any user-facing changes?
   
   Yes — **enhanced behavior for PyArrow extension arrays** in Python UDFs.
   
   * Users can now declare `input_types` and `return_type` as either 
`pa.DataType` or `pa.Field`.
   * Chained scalar UDFs now preserve PyArrow extension metadata (e.g. 
`arrow.uuid`, custom registered extensions).
   * Existing non-extension UDFs continue to function unchanged.
   
   No breaking API changes are introduced — the update is fully 
backward-compatible while extending functionality.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Preserve PyArrow extension metadata when chaining Python scalar UDFs [datafusion-python]

Reply via email to