Re: [PR] [SPARK-54316][CORE][PYTHON][SQL] Consolidate `GroupPandasIterUDFSerializer` with `GroupPandasUDFSerializer` [spark]

via GitHub Tue, 25 Nov 2025 16:57:49 -0800


Yicong-Huang commented on code in PR #53043:
URL: https://github.com/apache/spark/pull/53043#discussion_r2562240476



##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -1315,22 +1335,25 @@ def process_group(batches: "Iterator[pa.RecordBatch]"):
                 )
 
     def dump_stream(self, iterator, stream):
-        """
-        Flatten the (dataframes_generator, arrow_type) tuples by iterating 
over each generator.
-        This allows the iterator UDF to stream results without materializing 
all DataFrames.
-        """
-        # Flatten: (dataframes_generator, arrow_type) -> (df, arrow_type), 
(df, arrow_type), ...
-        flattened_iter = (
-            (df, arrow_type) for dataframes_gen, arrow_type in iterator for df 
in dataframes_gen
-        )
-
-        # Convert each (df, arrow_type) to the format expected by parent's 
dump_stream
-        series_iter = ([(df, arrow_type)] for df, arrow_type in flattened_iter)
+        # Flatten iterator of (generator, arrow_type) into (df, arrow_type) 
for parent class
+        def flatten_iterator():
+            for (
+                batches_gen,
+                arrow_type,
+            ) in iterator:  # tuple constructed in wrap_grouped_*_pandas_udf
+                # yields df for single UDF or [(df1, type1), (df2, type2), 
...] for multiple UDFs

Review Comment:
   Ok I mistakenly thought they should support multiple UDFs, thus the 
implementation became more complex. I have removed this assumption and 
simplified the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54316][CORE][PYTHON][SQL] Consolidate `GroupPandasIterUDFSerializer` with `GroupPandasUDFSerializer` [spark]

Reply via email to