GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20588

    [SPARK-23352][PYTHON][BRANCH-2.3] Explicitly specify supported types in 
Pandas UDFs

    This PR backports https://github.com/apache/spark/pull/20531 to branch-2.3.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark 
PR_TOOL_PICK_PR_20531_BRANCH-2.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20588.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20588
    
----
commit 44fe840186af03cfc31bed746524e9f4e01bd7a2
Author: hyukjinkwon <gurwls223@...>
Date:   2018-02-12T11:49:36Z

    [SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs
    
    This PR targets to explicitly specify supported types in Pandas UDFs.
    The main change here is to add a deduplicated and explicit type checking in 
`returnType` ahead with documenting this; however, it happened to fix multiple 
things.
    
    1. Currently, we don't support `BinaryType` in Pandas UDFs, for example, 
see:
    
        ```python
        from pyspark.sql.functions import pandas_udf
        pudf = pandas_udf(lambda x: x, "binary")
        df = spark.createDataFrame([[bytearray(1)]])
        df.select(pudf("_1")).show()
        ```
        ```
        ...
        TypeError: Unsupported type in conversion to Arrow: BinaryType
        ```
    
        We can document this behaviour for its guide.
    
    2. Also, the grouped aggregate Pandas UDF fails fast on `ArrayType` but 
seems we can support this case.
    
        ```python
        from pyspark.sql.functions import pandas_udf, PandasUDFType
        foo = pandas_udf(lambda v: v.mean(), 'array<double>', 
PandasUDFType.GROUPED_AGG)
        df = spark.range(100).selectExpr("id", "array(id) as value")
        df.groupBy("id").agg(foo("value")).show()
        ```
    
        ```
        ...
         NotImplementedError: ArrayType, StructType and MapType are not 
supported with PandasUDFType.GROUPED_AGG
        ```
    
    3. Since we can check the return type ahead, we can fail fast before actual 
execution.
    
        ```python
        # we can fail fast at this stage because we know the schema ahead
        pandas_udf(lambda x: x, BinaryType())
        ```
    
    Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were 
added.
    
    Author: hyukjinkwon <gurwls...@gmail.com>
    
    Closes #20531 from HyukjinKwon/pudf-cleanup.
    
    (cherry picked from commit c338c8cf8253c037ecd4f39bbd58ed5a86581b37)
    Signed-off-by: hyukjinkwon <gurwls...@gmail.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to