[jira] [Created] (SPARK-20396) Add support for pandas udf in pyspark

Li Jin (JIRA) Wed, 19 Apr 2017 10:37:01 -0700

Li Jin created SPARK-20396:
------------------------------

             Summary: Add support for pandas udf in pyspark
                 Key: SPARK-20396
                 URL: https://issues.apache.org/jira/browse/SPARK-20396
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: Li Jin



split-apply-merge is a common pattern when analyzing data. It is implemented in 
many popular data analyzing libraries such as Spark, Pandas, R, and etc. Split 
and merge operations in these libraries are similar to each other, mostly 
implemented by certain grouping operators. For instance, Spark DataFrame has 
groupBy, Pandas DataFrame has groupby. Therefore, for users familiar with 
either Spark DataFrame or pandas DataFrame, it is not difficult for them to 
understand how grouping works in the other library. However, apply is more 
native to different libraries and therefore, quite different between libraries. 
A pandas user knows how to use apply to do curtain transformation in pandas 
might not know how to do the same using pyspark. Also, the current 
implementation of passing data from the java executor to python executor is not 
efficient, there is opportunity to speed it up using Apache Arrow. This feature 
can enable use cases that uses Spark's grouping operators such as groupBy, 
rollUp, cube, window and Pandas's native apply operator.

Related work:

SPARK-13534
This enables faster data serialization between Pyspark and Pandas using Apache 
Arrow. Our work will be on top of this and use the same serialization for 
pandas udf.

SPARK-12919 and SPARK-12922
These implemented two functions: dapply and gapply in Spark R which implements 
the similar split-apply-merge pattern that we want to implement with Pyspark. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20396) Add support for pandas udf in pyspark

Reply via email to