d80tb7 commented on issue #24981: [WIP][SPARK-27463][PYTHON] Support Dataframe 
Cogroup via Pandas UDFs- Arrow Stream Impl
URL: https://github.com/apache/spark/pull/24981#issuecomment-510663626
 
 
   Thanks @BryanCutler for the review- very helpful!
   
   As this started off as a poc I think there's quite a lot going on here (and 
the branch name is now a bit misleading).  Do you have any objections if I move 
this into a more appropriately named branch and raise a new PR?  This will also 
give me a chance to tidy up some of the Scala code- which certainly needs a bit 
of tidy up.
   
   I've been following the discussions regarding changes to the pandas_udfs and 
I concur that it should (hopefully) only require the pandas_udf declaration to 
change.
   
   One thing that I would like to decide before going too much further is the 
public api.  We discussed this on the Jira and I don't think we reached a 
consensus. I would like to proceed based on what we have in this pull request 
i.e:  `df1.groupby('id').cogroup(df2.groupby('id')).apply(func)`  but if people 
feel strongly about the the alternatives we suggested then I am  open to 
change.  My reasoning here is that we can change things like using/not using 
Arrow stream in folow up work, but the public APi will be pretty much fixed 
once this is merged in. @HyukjinKwon @icexelloss do you agree?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to