[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

HyukjinKwon Thu, 11 Jan 2018 07:51:03 -0800

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20237


    [SPARK-22980][PYTHON][SQL] Clarify the length of each series is of each 
batch within scalar Pandas UDF

    ## What changes were proposed in this pull request?
    
    This PR proposes to add a note that saying the length of a scalar Pandas 
UDF's `Series` is not of the whole input column but of the batch.
    
    We are fine for a group map UDF because the usage is different from our 
typical UDF but scalar UDFs might cause confusion with the normal UDF.
    
    For example, please consider this example:
    
    ```python
    from pyspark.sql.functions import pandas_udf, col, lit
    
    df = spark.range(1)
    f = pandas_udf(lambda x, y: len(x) + y, LongType())
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 1|
    +------------------+
    ```
    
    ```python
    from pyspark.sql.functions import udf, col, lit
    
    df = spark.range(1)
    f = udf(lambda x, y: len(x) + y, "long")
    df.select(f(lit('text'), col('id'))).show()
    ```
    
    ```
    +------------------+
    |<lambda>(text, id)|
    +------------------+
    |                 4|
    +------------------+
    ```
    
    ## How was this patch tested?
    
    Manually built the doc and checked the output.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-22980

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20237
    
----
commit d2cfed308d343fb55c5fd7c0d30bcbb987948632
Author: hyukjinkwon <gurwls223@...>
Date:   2018-01-11T15:31:05Z

    Clarify the length of each series is of each batch within scalar Pandas UDF

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20237: [SPARK-22980][PYTHON][SQL] Clarify the length of ...

Reply via email to