icexelloss commented on issue #24981: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl URL: https://github.com/apache/spark/pull/24981#issuecomment-512462884 @d80tb7 My personal preference is to keep this PR because there are already quite bit of discussion and comments on this PR that will be lost in the new PR. But it's not a big deal. In terms of public API, I think either way we would need the CogroupedData class so to me it seems just a matter of whether we want to provide an API wrapper on DataFrame: ``` def cogroup(self, other, on): return self.groupby(on).cogroup(other.groupby(on)) ``` I think this is a small code change that we can deal later without disrupting the majority change. For now my two cents is let us get this PR in a mergeable state with pending decision on: (1) `df.cogroup(df2, on='id')` or `df.groupby('id').cogroup(df2.groupby('id'))` (2) pandas_udf decorator change w.r.t to SPARK-28264 @d80tb7 @HyukjinKwon @BryanCutler what do you guys think?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
