[GitHub] [spark] d80tb7 commented on issue #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs

GitBox Wed, 26 Jun 2019 14:41:55 -0700

d80tb7 commented on issue #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe 
Cogroup via Pandas UDFs
URL: https://github.com/apache/spark/pull/24965#issuecomment-506056080
 
 
   @icexelloss - my understanding of @BryanCutler's idea is that he wants a 
completely separate arrow stream for every group.  In this case we would only 
have to hold 2 batches in memory at any one time, albeit at the cost of paying 
the stream overhead (schema etc)  for every group.
   
   Assuming that the stream overhead isn't significant (and I think it's 
reasonable to assume it won't be be)- then this should work.  I think the 
implementation might be a bit more tricky (you'd have to send some sort of 
marker to indicate that all the arrow streams have finished), but hopefully a 
poc could help with this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] d80tb7 commented on issue #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs

Reply via email to