[GitHub] [spark] d80tb7 opened a new pull request #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs

GitBox Tue, 25 Jun 2019 03:08:49 -0700

d80tb7 opened a new pull request #24965: [WIP][SPARK-27463][PYTHON] Support 
Dataframe Cogroup via Pandas UDFs
URL: https://github.com/apache/spark/pull/24965
 
 
   This is a rough first cut of a Pandas Udf cogroup implementation.  Currently 
implemented is:
   
   -     JVM serialisation for interleaved dataframes.
   -     Python deserialisation for interleaved dataframes
   -     A skeleton cogroup implementation
   
   The code is still pretty rough  with the main caveats being:
   
   - The data passing is pretty minimal (e.g. it only supports exactly two 
dataframes, there's no ability distinguish on the python side between key and 
value columns etc)
   - The cogroup implementation doesn't work properly in the case of grouping 
by a string as attribute resolution fails.
   
   At this point I think I'd like to focus on:
   
       Does the Data passing mechanism (i.e. the deviation from arrow 
streaming) make sense.
       If we are going to introduce such a data passing mechanism how complex 
should it be?
       Does the high level implementation of the cogroup here make sense.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] d80tb7 opened a new pull request #24965: [WIP][SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs

Reply via email to