Yicong-Huang commented on code in PR #53043:
URL: https://github.com/apache/spark/pull/53043#discussion_r2562240476
##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -1315,22 +1335,25 @@ def process_group(batches: "Iterator[pa.RecordBatch]"):
)
def dump_stream(self, iterator, stream):
- """
- Flatten the (dataframes_generator, arrow_type) tuples by iterating
over each generator.
- This allows the iterator UDF to stream results without materializing
all DataFrames.
- """
- # Flatten: (dataframes_generator, arrow_type) -> (df, arrow_type),
(df, arrow_type), ...
- flattened_iter = (
- (df, arrow_type) for dataframes_gen, arrow_type in iterator for df
in dataframes_gen
- )
-
- # Convert each (df, arrow_type) to the format expected by parent's
dump_stream
- series_iter = ([(df, arrow_type)] for df, arrow_type in flattened_iter)
+ # Flatten iterator of (generator, arrow_type) into (df, arrow_type)
for parent class
+ def flatten_iterator():
+ for (
+ batches_gen,
+ arrow_type,
+ ) in iterator: # tuple constructed in wrap_grouped_*_pandas_udf
+ # yields df for single UDF or [(df1, type1), (df2, type2),
...] for multiple UDFs
Review Comment:
Ok I mistakenly thought they should support multiple UDFs, thus the
implementation became more complex. I have removed this assumption and
simplified the code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]