icexelloss commented on issue #24981: [WIP][SPARK-27463][PYTHON] Support 
Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl
URL: https://github.com/apache/spark/pull/24981#issuecomment-513961159
 
 
   @d80tb7 Very interesting thoughts! Thank you.
   
   Now as I think more, I feel that cogroup operation has two parts: groupby 
and join. When I looked at sth like:
   
   ```
   df1.cogroup(df2, df1['id'] == df2['id'2]).cogroup(df3, df1['id'] == 
df3['id3'])
   ```
   
   We are saying "group df1 by id1, df2 by id2 df3 by id3, and then join them 
by id1 == id2 == id3". Here we use `df1['id'] == df2['id2']` and `df1['id'] == 
df3['id3']` to express both groupby and join, which is a little confusing. If 
we write `df1['id'] == df2['id2'] / 2`, are we saying we are "group df1 by id, 
group df2 by id2 and join them with id == id2 / 2" or are we saying "groupby 
df1 by id, group df2 by id2 / 2 and join them with id == id2 / 2"? I think it 
would be a little confusing in cogroup. In join, because there is no notion of 
groupby, having one expression to express the join condition makes more sense.
   
   I think I now prefer explicit groupby just because I like to break this 
operation down two step because I do feel it is two step in nature.
   
   To go a step even further, I wonder if the API can express "group df by 
date, df2 by date, and join each group in df with groups in df2 that is within 
the past 3 days", I'd imagine sth like:
   
   ```
   gdf1 = df1.groupby('date')
   gdf2 = df2.groupby('date')
   
   gdf1.cogroup(gdf2, gdf2['date'] - "3 days"<= gdf1["date"] <= gdf2["date"])
   ```
   
   Which is probably pretty hard to do otherwise. Not saying we should do this 
types of cogroup but I feel having separate steps for group and join gives us a 
more expressive API.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to