[PR] [SPARK-54617] Enable Arrow Grouped Iter Aggregate UDF registration for SQL [spark]

via GitHub Fri, 05 Dec 2025 15:11:27 -0800


Yicong-Huang opened a new pull request, #53357:
URL: https://github.com/apache/spark/pull/53357


   ### What changes were proposed in this pull request?
   
   This PR enables Arrow grouped iter aggregate UDFs to be registered and used 
in SQL queries. Previously, Arrow iter aggregate UDFs could only be used via 
DataFrame API, but not in SQL.
   
   The main change is adding `SQL_GROUPED_AGG_ARROW_ITER_UDF` to the allowed 
eval types in `UDFRegistration.register()` method, along with comprehensive 
test cases.
   
   ### Why are the changes needed?
   
   Arrow iter aggregate UDFs provide a memory-efficient way to perform grouped 
aggregations by processing data in batches iteratively. However, they could 
only be used via DataFrame API, not in SQL queries. This limitation prevented 
users from using these UDFs in SQL-based workflows.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. Users can now register Arrow grouped iter aggregate UDFs and use them 
in SQL queries. Previously, attempting to register such UDFs would raise a 
`PySparkTypeError` with message "Eval type for UDF must be SQL_BATCHED_UDF, ... 
or SQL_GROUPED_AGG_ARROW_UDF", and they could only be used via DataFrame API.
   
   Example:
   ```python
   from typing import Iterator
   from pyspark.sql.functions import arrow_udf
   import pyarrow as pa
   
   @arrow_udf("double")
   def arrow_mean_iter(it: Iterator[pa.Array]) -> float:
       sum_val = 0.0
       cnt = 0
       for v in it:
           sum_val += pa.compute.sum(v).as_py()
           cnt += len(v)
       return sum_val / cnt if cnt > 0 else 0.0
   
   # Now this works:
   spark.udf.register("arrow_mean_iter", arrow_mean_iter)
   spark.sql("SELECT id, arrow_mean_iter(v) as mean FROM test_table GROUP BY 
id").show()
   ```
   
   ### How was this patch tested?
   
   Added comprehensive test cases covering:
   - Single column Arrow iter aggregate UDF in SQL
   - Multiple columns Arrow iter aggregate UDF in SQL
   - Registering Arrow iter aggregate UDF and using it in SQL
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-54617] Enable Arrow Grouped Iter Aggregate UDF registration for SQL [spark]

Reply via email to