icexelloss commented on issue #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs URL: https://github.com/apache/spark/pull/24965#issuecomment-506052685 @BryanCutler I think the main issue with the approach that you suggested is that the python worker needs to hold much more data. For example, assuming each Arrow Stream has 10 batches, in order to process the first cogroup, the worker will need to read all 10 batches from the left table and the first batch from the right table. That could be significant more memory usage in the python worker. If we want to send two Arrow stream, I think we would need to do it with two separate connections between Python and Java so the Python worker can alternate between the two streams. I think could be more complicated but not entirely sure. Is this what you prefer?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
