d80tb7 commented on issue #24981: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl URL: https://github.com/apache/spark/pull/24981#issuecomment-510663626 Thanks @BryanCutler for the review- very helpful! As this started off as a poc I think there's quite a lot going on here (and the branch name is now a bit misleading). Do you have any objections if I move this into a more appropriately named branch and raise a new PR? This will also give me a chance to tidy up some of the Scala code- which certainly needs a bit of tidy up. I've been following the discussions regarding changes to the pandas_udfs and I concur that it should (hopefully) only require the pandas_udf declaration to change. One thing that I would like to decide before going too much further is the public api. We discussed this on the Jira and I don't think we reached a consensus. I would like to proceed based on what we have in this pull request i.e: `df1.groupby('id').cogroup(df2.groupby('id')).apply(func)` but if people feel strongly about the the alternatives we suggested then I am open to change. My reasoning here is that we can change things like using/not using Arrow stream in folow up work, but the public APi will be pretty much fixed once this is merged in. @HyukjinKwon @icexelloss do you agree?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
