Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]

via GitHub Thu, 05 Feb 2026 05:15:49 -0800


kosiew commented on PR #1347:
URL: 
https://github.com/apache/datafusion-python/pull/1347#issuecomment-3853586582


   @timsaucer,
   
   Thanks for your suggestions. 
   
   I agree on both points  and have refactored the implementation to align more 
closely with your approach, while also taking care to address the nanoarrow 
concern in a clear and safe way.
   
   ### Implementation Details
   
   1. **Direct scalar handling**
      When the user returns a PyArrow scalar, we simply use it as-is via the 
existing `PyScalarValue` extraction—no extra work required.
   
   2. **Fallback to `py_obj_to_scalar_value`**
      For anything that isn’t already a scalar (native Python values, arrays, 
etc.), we route through `py_obj_to_scalar_value`, which now cleanly handles the 
conversion.
   
   3. **Extended `py_obj_to_scalar_value`**
      This function now:
   
      * Checks whether the object is already a `pyarrow.Scalar` using an 
explicit `isinstance()` check and extracts it directly.
      * Detects `pyarrow.Array` or `pyarrow.ChunkedArray` (also via 
`isinstance()`) and converts them into `ScalarValue::List` using the Arrow C 
data interface—**no `to_pylist()` calls involved**.
      * Falls back to the original behavior for native Python values (ints, 
floats, strings, etc.), converting them via `pyarrow.scalar()`.
   
   4. **Applied consistently to `state()` and `evaluate()`**
      Both methods now share this unified conversion path, ensuring consistent 
and predictable behavior.
   
   ### Why `isinstance()` instead of `__arrow_c_array__`?
   
   I intentionally avoided checking the `__arrow_c_array__` protocol and opted 
for explicit `isinstance()` checks against `pyarrow.Scalar`, `pyarrow.Array`, 
and `pyarrow.ChunkedArray`. This keeps things clear and robust:
   
   * Scalar objects from libraries like nanoarrow that implement 
`__arrow_c_array__` are still correctly treated as scalars, rather than being 
misclassified as lists.
   * Only true PyArrow array types are converted into `ScalarValue::List`.
   * The type checks remain explicit, readable, and safe.
   
   ### Benefits 
   
   * **No performance penalty**: Avoids `to_pylist()` entirely by relying on 
the Arrow C data interface.
   * **Flexible**: Users can return native Python values, PyArrow scalars, or 
PyArrow arrays—everything is handled gracefully.
   * **Consistent**: Both `state()` and `evaluate()` now follow the same 
conversion logic.
   * **Safe**: Clear type discrimination prevents nanoarrow scalars from being 
misclassified.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]

Reply via email to