d80tb7 edited a comment on issue #24981: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs- Arrow Stream Impl URL: https://github.com/apache/spark/pull/24981#issuecomment-513181811 @hjoo Yes the current implementation is effectively a full outer join with empty dataframes being passed if only one side of the cogroup has a matching key. I think this makes a sensible starting point as it's consistent with dataset's cogroup and obviously empty dataframes can be handled inside the udf as the user sees fit. That said, I can see a need for supporting other 'join' semantics here as otherwise there will be a large subset of udfs that will have to include boilerplate for handling the empty dataframe case and that will get a bit tiresome. This should be fairly straightforward to add and should be backwards compatible (so long as we keep the default behaviour as 'full outer') so I'd prefer to add this as a follow up piece of work if possible. Does that seem reasonable?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
