Re: [PR] [SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF [spark]

via GitHub Tue, 25 Nov 2025 13:40:02 -0800


Kimahriman commented on code in PR #53035:
URL: https://github.com/apache/spark/pull/53035#discussion_r2561531946



##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -1222,53 +1222,31 @@ def __init__(
 
     def load_stream(self, stream):
         """
-        Yield column iterators instead of concatenating batches.
-        Each group yields a structure where indexing by column offset gives an 
iterator of arrays.
+        Yield an iterator that produces one list of column arrays per batch.
+        Each group yields Iterator[List[pa.Array]], allowing UDF to process 
batches one by one
+        without consuming all batches upfront.
         """
+
+        def process_group(batches: "Iterator[pa.RecordBatch]"):
+            # Convert each Arrow batch to a list of column arrays on-demand, 
yielding one list per batch
+            for batch in batches:
+                # Extract all columns from the batch as a list of arrays
+                column_arrays = [batch.column(col_idx) for col_idx in 
range(batch.num_columns)]
+                yield column_arrays

Review Comment:
   Is this effectively just
   ```suggestion
                   yield batch.columns
   ```
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-53615][PYTHON] Introduce iterator API for Arrow grouped aggregation UDF [spark]

Reply via email to