[GitHub] [spark] icexelloss edited a comment on issue #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs

GitBox Wed, 26 Jun 2019 14:31:51 -0700

icexelloss edited a comment on issue #24965: [WIP][SPARK-27463][PYTHON] Support 
Dataframe Cogroup via Pandas UDFs
URL: https://github.com/apache/spark/pull/24965#issuecomment-506052685
 
 
   @BryanCutler I think the main issue with the approach that you suggested is 
that the python worker needs to hold much more data. For example, assuming each 
Arrow Stream has 10 batches, in order to process the first cogroup, the worker 
will need to read all 10 batches from the left table and the first batch from 
the right table. So in total of 11 batches instead of 2. I think that could be 
significant more memory usage in the python worker.
   
   If we want to send two Arrow stream, I think we would need to do it with two 
separate connections between Python and Java so the Python worker can alternate 
between the two streams. I think could be more complicated but not entirely 
sure. Is this what you prefer?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] icexelloss edited a comment on issue #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs

Reply via email to