Kimahriman commented on code in PR #49005:
URL: https://github.com/apache/spark/pull/49005#discussion_r1867611465


##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -728,8 +743,9 @@ def load_stream(self, stream):
             dataframes_in_group = read_int(stream)
 
             if dataframes_in_group == 2:
+                # We need to fully load the left batches, but we can lazily 
load the right batches

Review Comment:
   The JVM side for grouped map was adapted from and consolidate with the 
cogroup code to use an Arrow stream per group, and that has benefits of at 
least not collecting an entire group on the JVM side before serializing it, so 
I think that should stay. I can undo the Python side changes and leave only the 
Table -> Table API for cogroup if that's desired. There are at least _some_ 
improvements to this implementation of cogroup, especially with being able to 
return an iterator of RecordBatches



##########
python/pyspark/sql/pandas/serializers.py:
##########
@@ -728,8 +743,9 @@ def load_stream(self, stream):
             dataframes_in_group = read_int(stream)
 
             if dataframes_in_group == 2:
+                # We need to fully load the left batches, but we can lazily 
load the right batches

Review Comment:
   The JVM side for grouped map was adapted from and consolidated with the 
cogroup code to use an Arrow stream per group, and that has benefits of at 
least not collecting an entire group on the JVM side before serializing it, so 
I think that should stay. I can undo the Python side changes and leave only the 
Table -> Table API for cogroup if that's desired. There are at least _some_ 
improvements to this implementation of cogroup, especially with being able to 
return an iterator of RecordBatches



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to