Kimahriman commented on code in PR #49005:
URL: https://github.com/apache/spark/pull/49005#discussion_r1867611465
##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -728,8 +743,9 @@ def load_stream(self, stream):
dataframes_in_group = read_int(stream)
if dataframes_in_group == 2:
+ # We need to fully load the left batches, but we can lazily
load the right batches
Review Comment:
The JVM side for grouped map was adapted from and consolidate with the
cogroup code to use an Arrow stream per group, and that has benefits of at
least not collecting an entire group on the JVM side before serializing it, so
I think that should stay. I can undo the Python side changes and leave only the
Table -> Table API for cogroup if that's desired. There are at least _some_
improvements to this implementation of cogroup, especially with being able to
return an iterator of RecordBatches
##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -728,8 +743,9 @@ def load_stream(self, stream):
dataframes_in_group = read_int(stream)
if dataframes_in_group == 2:
+ # We need to fully load the left batches, but we can lazily
load the right batches
Review Comment:
The JVM side for grouped map was adapted from and consolidated with the
cogroup code to use an Arrow stream per group, and that has benefits of at
least not collecting an entire group on the JVM side before serializing it, so
I think that should stay. I can undo the Python side changes and leave only the
Table -> Table API for cogroup if that's desired. There are at least _some_
improvements to this implementation of cogroup, especially with being able to
return an iterator of RecordBatches
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]