icexelloss edited a comment on issue #24981: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl URL: https://github.com/apache/spark/pull/24981#issuecomment-513961159 @d80tb7 Very interesting thoughts! Thank you. Now as I think more, I feel that cogroup operation has two parts: groupby and join. When I looked at sth like: ``` df1.cogroup(df2, df1['id'] == df2['id'2]).cogroup(df3, df1['id'] == df3['id3']) ``` We are saying "group df1 by id1, df2 by id2 df3 by id3, and then join them by id1 == id2 == id3". Here we use `df1['id'] == df2['id2']` and `df1['id'] == df3['id3']` to express both groupby and join, which is a little confusing. If we write `df1['id'] == df2['id2'] / 2`, are we saying we are "group df1 by id, group df2 by id2 and join them with id == id2 / 2" or are we saying "groupby df1 by id, group df2 by id2 / 2 and join them with id == id2 / 2"? I think it would be a little confusing in cogroup. In join, because there is no notion of groupby, having one expression to express the join condition makes more sense. I think I now prefer explicit groupby just because I like to break this operation down two step because I do feel it is two step in nature. To go a step even further, I wonder if the API can express "group df by date, df2 by date, and join each group in df with groups in df2 that is within the past 3 days", I'd imagine sth like: ``` gdf1 = df1.groupby('date') gdf2 = df2.groupby('date') gdf1.cogroup(gdf2, gdf2['date'] - "3 days"<= gdf1["date"] <= gdf2["date"]) ``` Which is probably pretty hard to do otherwise. Not saying we should do this types of cogroup but I feel having separate steps for group and join gives us a more expressive API and fits more with the nature of the operation .
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
