(spark) branch master updated: [SPARK-54738][PYTHON] Add Profiler Support for Pandas Grouped Iter Aggregate UDF

ruifengz Wed, 17 Dec 2025 19:38:41 -0800

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new a4e456318eae [SPARK-54738][PYTHON] Add Profiler Support for Pandas 
Grouped Iter Aggregate UDF
a4e456318eae is described below

commit a4e456318eae32c16f533d155e6c0f2d7c925818
Author: Yicong-Huang <[email protected]>
AuthorDate: Thu Dec 18 11:37:03 2025 +0800

    [SPARK-54738][PYTHON] Add Profiler Support for Pandas Grouped Iter 
Aggregate UDF
    
    ### What changes were proposed in this pull request?
    
    - `spark.python.profile`: Add `SQL_GROUPED_AGG_PANDAS_ITER_UDF` to the 
profiler warning list in `udf.py` so that when `spark.python.profile` is 
enabled, users will see appropriate warnings consistent with other 
iterator-based UDFs.
    - `spark.sql.pyspark.udf.profiler`: No changes needed. This UDF type 
already works correctly because it returns scalar (not iterator), so it uses 
the non-iterator profiler branch in `wrap_perf_profiler` and 
`wrap_memory_profiler`.
    
    ### Why are the changes needed?
    
    To make profilers support for `SQL_GROUPED_AGG_PANDAS_ITER_UDF` consistent 
with other UDFs.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. When users enable `spark.python.profile` with 
`SQL_GROUPED_AGG_PANDAS_ITER_UDF`, they will now see a warning message 
consistent with other iterator-based UDFs.
    
    ### How was this patch tested?
    
    Added a test case `test_perf_profiler_pandas_udf_grouped_agg_iter` to 
verify that `spark.sql.pyspark.udf.profiler` works correctly with this UDF 
type. Also verified that the `spark.python.profile` profiler warning is 
triggered correctly in `test_unsupported`.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #53511 from 
Yicong-Huang/SPARK-54738/feat/add-profiler-support-for-grouped-iter-agg-udf.
    
    Authored-by: Yicong-Huang <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 python/pyspark/sql/tests/test_udf_profiler.py | 27 +++++++++++++++++++++++++++
 python/pyspark/sql/udf.py                     |  1 +
 2 files changed, 28 insertions(+)

diff --git a/python/pyspark/sql/tests/test_udf_profiler.py 
b/python/pyspark/sql/tests/test_udf_profiler.py
index 4d565ecfd939..09ae6baaee57 100644
--- a/python/pyspark/sql/tests/test_udf_profiler.py
+++ b/python/pyspark/sql/tests/test_udf_profiler.py
@@ -537,6 +537,33 @@ class UDFProfiler2TestsMixin:
         for id in self.profile_results:
             self.assert_udf_profile_present(udf_id=id, 
expected_line_count_prefix=2)
 
+    @unittest.skipIf(
+        not have_pandas or not have_pyarrow,
+        cast(str, pandas_requirement_message or pyarrow_requirement_message),
+    )
+    def test_perf_profiler_pandas_udf_grouped_agg_iter(self):
+        import pandas as pd
+
+        @pandas_udf("double")
+        def pandas_mean_iter(it: Iterator[pd.Series]) -> float:
+            sum_val = 0.0
+            cnt = 0
+            for v in it:
+                sum_val += v.sum()
+                cnt += len(v)
+            return sum_val / cnt if cnt > 0 else 0.0
+
+        with self.sql_conf({"spark.sql.pyspark.udf.profiler": "perf"}):
+            df = self.spark.createDataFrame(
+                [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", 
"v")
+            )
+            df.groupBy(df.id).agg(pandas_mean_iter(df["v"])).show()
+
+        self.assertEqual(1, len(self.profile_results), 
str(self.profile_results.keys()))
+
+        for id in self.profile_results:
+            self.assert_udf_profile_present(udf_id=id, 
expected_line_count_prefix=2)
+
     @unittest.skipIf(
         not have_pandas or not have_pyarrow,
         cast(str, pandas_requirement_message or pyarrow_requirement_message),
diff --git a/python/pyspark/sql/udf.py b/python/pyspark/sql/udf.py
index 34c69d1ffda2..61f65d5243b7 100644
--- a/python/pyspark/sql/udf.py
+++ b/python/pyspark/sql/udf.py
@@ -439,6 +439,7 @@ class UserDefinedFunction:
                 PythonEvalType.SQL_MAP_PANDAS_ITER_UDF,
                 PythonEvalType.SQL_MAP_ARROW_ITER_UDF,
                 PythonEvalType.SQL_GROUPED_AGG_ARROW_ITER_UDF,
+                PythonEvalType.SQL_GROUPED_AGG_PANDAS_ITER_UDF,
             ]:
                 warnings.warn(
                     "Profiling UDFs with iterators input/output is not 
supported.",


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-54738][PYTHON] Add Profiler Support for Pandas Grouped Iter Aggregate UDF

Reply via email to