d80tb7 commented on issue #24981: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl URL: https://github.com/apache/spark/pull/24981#issuecomment-513541503 @icexelloss- I do like the api you propose- for the common case of cogoup two dataframes on a common comon it's by far the most natural of everything that's been discussed. Once concern I do have is with the ability to extend it to support a cogroup containing multiple dataframes (we certainly have use cases where we would like to cogroup 3 or 4 dataframes together and I would like to add this as a follow up). Clearly, if there is a common cogroup column then this should be possible i.e: `df1.cogroup(df2, df3, 'id')` but it seems more tricky to extend this to support an arbitrary cogroup expression. I.e. the equivalent of (the admittedly verbose): ``` gdf1 = df1.groupby('id1') gdf2 = df2.groupby('id2') gdf3 = df3groupby('id3') gdf1.cogroup(gdf2, gdf2) ``` One option here could be to specifically disallow cogrouping by anything but a common column. In isolation, I'd be fine with this as a) the udf code generally becomes much simpler if the common key column is actually named the same in all dfs b) reasoning about a cogroup becomes somewhat difficult if you have arbitrarily complex cogroup expresssions and c) you can always use spark to make a common column using whatever expressions you see fit. I think the disadvantage with this approach is that it would break api consitency- i.e. groupby().apply() lets you group by an arbitrary expresion but cogroup().apply() doesn't. A further option could be to make this look more like the join api- so: ``` df1.cogroup(df2, df1['id'] == df2['id'2]).cogroup(df3, df1['id'] == df3['id3']) ``` An added benefit of this would be that it would allow us to support rigght join taps (e.g. right outer) in the case of multiple dataframes (which I don't think we could in the other cases), the disadvantge would be that the implementation here might be very different to what we currently have and not necessarily starightforward (I'd have to check). thoughts?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
