d80tb7 opened a new pull request #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs URL: https://github.com/apache/spark/pull/24965 This is a rough first cut of a Pandas Udf cogroup implementation. Currently implemented is: - JVM serialisation for interleaved dataframes. - Python deserialisation for interleaved dataframes - A skeleton cogroup implementation The code is still pretty rough with the main caveats being: - The data passing is pretty minimal (e.g. it only supports exactly two dataframes, there's no ability distinguish on the python side between key and value columns etc) - The cogroup implementation doesn't work properly in the case of grouping by a string as attribute resolution fails. At this point I think I'd like to focus on: Does the Data passing mechanism (i.e. the deviation from arrow streaming) make sense. If we are going to introduce such a data passing mechanism how complex should it be? Does the high level implementation of the cogroup here make sense.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
