[GitHub] [spark] d80tb7 edited a comment on issue #24981: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl

GitBox Fri, 19 Jul 2019 03:38:40 -0700

d80tb7 edited a comment on issue #24981: [WIP][SPARK-27463][PYTHON] Support 
Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl
URL: https://github.com/apache/spark/pull/24981#issuecomment-513181811
 
 
   @hjoo Yes the current implementation is effectively a full outer join with 
empty dataframes being passed if only one side of the cogroup has a matching 
key.  I think this makes a sensible starting point as it's consistent with 
dataset's cogroup and obviously empty dataframes can be handled inside the udf 
as the user sees fit.  That said, I can see a need for supporting other 'join' 
semantics here as otherwise there will be a large subset of udfs that will have 
to include boilerplate for handling the empty dataframe case and that will get 
a bit tiresome.  This should be fairly straightforward to add and should be 
backwards compatible (so long as we keep the default behaviour as 'full outer') 
so I'd prefer to add this as a follow up piece of work if possible.  Does that 
seem reasonable?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] d80tb7 edited a comment on issue #24981: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl

Reply via email to