Yicong-Huang opened a new pull request, #53493:
URL: https://github.com/apache/spark/pull/53493
### What changes were proposed in this pull request?
This PR adds `SQL_GROUPED_AGG_PANDAS_ITER_UDF` to the list of supported eval
types in `UDFRegistration.register()` method, allowing users to register Pandas
Grouped Iter Aggregate UDFs for SQL usage.
### Why are the changes needed?
Currently, the iterator API for grouped aggregate Pandas UDFs cannot be
registered for SQL usage via `spark.udf.register()`. This is inconsistent with
other UDF types like `SQL_GROUPED_AGG_ARROW_ITER_UDF` which is already
supported.
With this change, users can now register iterator-based grouped aggregate
UDFs and use them in SQL queries:
```python
@pandas_udf("double")
def sum_iter_udf(it: Iterator[pd.Series]) -> float:
total = 0.0
for series in it:
total += series.sum()
return total
spark.udf.register("sum_iter_udf", sum_iter_udf)
spark.sql("SELECT sum_iter_udf(v) FROM table GROUP BY id")
```
### Does this PR introduce _any_ user-facing change?
Yes. Users can now register Pandas Grouped Iter Aggregate UDFs
(`Iterator[pd.Series] -> scalar`) for SQL usage.
### How was this patch tested?
Added a new test case `test_register_grouped_agg_iter_udf` in
`python/pyspark/sql/tests/pandas/test_pandas_udf_grouped_agg.py`.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]