[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...

ueshin Wed, 06 Sep 2017 05:17:11 -0700

GitHub user ueshin opened a pull request:

    https://github.com/apache/spark/pull/19147


    [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Python

    ## What changes were proposed in this pull request?
    
    This pr introduces vectorized UDFs in Python.
    Note that this pr should focus on APIs for vectorized UDFs, not APIs for 
vectorized UDAFs or Window operations.
    
    **Proposed API**
    
    We introduce a `@pandas_udf` decorator (or annotation) to define vectorized 
UDFs which takes one or more `pandas.Series` or one integer value meaning the 
length of the input value for 0-parameter UDFs. The return value should be 
`pandas.Series` of the specified type and the length of the returned value 
should be the same as input value.
    
    We can define vectorized UDFs as:
    
    ```python
      @pandas_udf(DoubleType())
      def plus(v1, v2):
          return v1 + v2
    ```
    
    or we can define as:
    
    ```python
      plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
    ```
    
    We can use it similar to row-by-row UDFs:
    
    ```python
      df.withColumn('sum', plus(df.v1, df.v2))
    ```
    
    As for 0-parameter UDFs, we can define and use as:
    
    ```python
      @pandas_udf(LongType())
      def f0(size):
          return pd.Series(1).repeat(size)
    
      df.select(f0())
    ```
    
    ## How was this patch tested?
    
    Added tests and existing tests.
    
    ## TBD
    
    - [ ] the way to specify the size hint to 0-parameter UDF (or for more 
parameter UDFs to be consistent)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ueshin/apache-spark 
issues/SPARK-21990/vectorizedudfs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19147.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19147
    
----
commit a2a3f8205394aa3d48a22972d00263897e0035b1
Author: Takuya UESHIN <ues...@databricks.com>
Date:   2017-08-23T09:05:05Z

    Introduce vectorized UDF in Python.

commit a1e4f62c993cc4c35523ed947178a72a2aadb753
Author: Takuya UESHIN <ues...@databricks.com>
Date:   2017-09-06T04:46:21Z

    Add check if the length of returned value is the same as input value.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...

Reply via email to