kosiew opened a new pull request, #1347:
URL: https://github.com/apache/datafusion-python/pull/1347

   ## Which issue does this PR close?
   
   * Closes #1339.
   
   ## Rationale for this change
   
   Python UDAFs that logically return a *list* (e.g., “collect all timestamps 
for a group”) were easy to implement incorrectly by returning a `pyarrow.Array` 
(or `ChunkedArray`) directly from `Accumulator.evaluate()`. In that case, 
DataFusion’s Python ↔ Rust conversion path could attempt to treat the array 
object as an integer-like scalar, leading to confusing conversion failures such 
as:
   
   * `ArrowTypeError: object of type <class 'pyarrow.lib.TimestampArray'> 
cannot be converted to int`
   
   To make list-returning UDAFs work reliably (and make the contract explicit), 
we now ensure the UDAF result is a list-valued *scalar* that matches the 
declared Arrow return type.
   
   ## What changes are included in this PR?
   
   * **Documentation & API contract clarification**
   
     * Added a FAQ entry explaining how to return a list from a UDAF (return a 
list-valued `pa.scalar` and declare list types for `return_type` and 
`state_type`).
     * Expanded `Accumulator.evaluate()` docstring to explicitly state that 
returning a `pyarrow.Array` is not supported unless converted to a list-valued 
scalar.
   
   * **Runtime behavior improvements in the Rust accumulator bridge**
   
     * In `RustAccumulator.evaluate`, detect when the declared return type is a 
list (`List`, `LargeList`, `FixedSizeList`) and the Python `evaluate()` result 
is `pyarrow.Array`/`pyarrow.ChunkedArray`.
     * Convert such array-like values into a list-valued `pyarrow.scalar(..., 
type=<declared list type>)` before extracting the scalar value for DataFusion.
     * Cache the `pyarrow.Array` and `pyarrow.ChunkedArray` Python types inside 
the accumulator to avoid repeated `pyarrow` imports/type lookups on every 
`evaluate` call.
   
   * **Tests**
   
     * Added a regression test (`test_udaf_list_timestamp_return`) covering a 
UDAF that returns a list of `timestamp(ns)` values.
     * Added a small `CollectTimestamps` accumulator used by the test that 
maintains list state as a list-valued scalar and returns a list-valued scalar 
from `evaluate()`.
   
   ## Are these changes tested?
   
   Yes.
   
   * Added a new regression test verifying that a UDAF can return 
`list<timestamp(ns)>` and that the collected results match the expected 
list-of-timestamps output.
   
   ## Are there any user-facing changes?
   
   Yes.
   
   * **Docs:** Added guidance in the UDF/UDAF user guide on returning lists 
from UDAFs.
   * **Behavior:** If a user returns a `pyarrow.Array`/`ChunkedArray` from 
`evaluate()` for a list-typed UDAF, it is now converted into the correct 
list-valued scalar automatically (rather than failing with a type conversion 
error).
   * **No breaking API changes** are intended; this clarifies and enforces the 
existing contract that `evaluate()` must return a scalar.
   
   ## LLM-generated code disclosure
   
   This PR includes code, comments generated with assistance from LLM. All 
LLM-generated content has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to