d80tb7 commented on issue #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs URL: https://github.com/apache/spark/pull/24965#issuecomment-506056080 @icexelloss - my understanding of @BryanCutler's idea is that he wants a completely separate arrow stream for every group. In this case we would only have to hold 2 batches in memory at any one time, albeit at the cost of paying the stream overhead (schema etc) for every group. Assuming that the stream overhead isn't significant (and I think it's reasonable to assume it won't be be)- then this should work. I think the implementation might be a bit more tricky (you'd have to send some sort of marker to indicate that all the arrow streams have finished), but hopefully a poc could help with this.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
