d80tb7 commented on issue #24981: [WIP][SPARK-27463][PYTHON] Support Dataframe 
Cogroup via Pandas UDFs- Arrow Stream Impl
URL: https://github.com/apache/spark/pull/24981#issuecomment-513541503
 
 
   @icexelloss- I do like the api you propose- for the common case of cogoup 
two dataframes on a common comon it's by far the most natural of everything 
that's been discussed.  Once concern I do have is with the ability to extend it 
to support a cogroup containing multiple  dataframes (we certainly have use 
cases where we would like to cogroup 3 or 4 dataframes together and I would 
like to add this as a follow up).  Clearly, if there is a common cogroup column 
then this should be possible i.e:
   
   `df1.cogroup(df2, df3, 'id')`
   
   but it seems more tricky to extend this to support an arbitrary cogroup 
expression. I.e. the equivalent of (the admittedly verbose):
   
   ```
   gdf1 = df1.groupby('id1')
   gdf2 = df2.groupby('id2')
   gdf3 = df3groupby('id3')
   gdf1.cogroup(gdf2, gdf2)
   ```
   One option here could be to specifically disallow cogrouping by anything but 
a common column. In isolation, I'd be fine with this as a) the udf code 
generally becomes much simpler if the common key column is actually named the 
same in all dfs  b) reasoning about a cogroup becomes somewhat difficult if you 
have arbitrarily complex cogroup expresssions and c) you can always use spark 
to make a common column using whatever expressions you see fit.  I think the 
disadvantage with this approach is that it would break api consitency- i.e. 
groupby().apply() lets you group by an arbitrary expresion but 
cogroup().apply() doesn't.
   
   A further option could be to make this look more like the join api- so:
   
   ```
   df1.cogroup(df2, df1['id'] == df2['id'2]).cogroup(df3, df1['id'] == 
df3['id3'])
   ```
   
   An added benefit of this would be that it would allow us to support rigght 
join taps (e.g. right outer) in the case of multiple dataframes (which I don't 
think we could in the other cases), the disadvantge would be that the 
implementation here might be very different to what we currently have and not 
necessarily starightforward (I'd have to check).
   
   thoughts?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to