GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/20531
[SPARK-23352][PYTHON] Explicitly specify supported types in Pandas UDFs
## What changes were proposed in this pull request?
This PR targets to explicitly specify supported types in Pandas UDFs.
The main change here is to add a deduplicated and explicit type checking in
`returnType` ahead with documenting this; however, it happened to fix multiple
things.
1. Currently, we don't support `BinaryType` in Pandas UDFs, for example,
see:
```python
from pyspark.sql.functions import pandas_udf
pudf = pandas_udf(lambda x: x, "binary")
df = spark.createDataFrame([[bytearray("a")]])
df.select(pudf("_1")).show()
```
```
...
TypeError: Unsupported type in conversion to Arrow: BinaryType
```
We can document this behaviour for its guide.
2. Also, the grouped aggregate Pandas UDF fail fast on `ArrayType` but
seems we can support this case.
```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
foo = pandas_udf(lambda v: v.mean(), 'array<double>',
PandasUDFType.GROUPED_AGG)
df = spark.range(100).selectExpr("id", "array(id) as value")
df.groupBy("id").agg(foo("value")).show()
```
```
...
NotImplementedError: ArrayType, StructType and MapType are not
supported with PandasUDFType.GROUPED_AGG
```
3. Since we can check the return type ahead, we can fail fast before actual
execution.
```python
# we can fail fast at this stage because we know the schema ahead
pandas_udf(lambda x: x, BinaryType())
```
## How was this patch tested?
Manually tested and unit tests for `BinaryType` and `ArrayType(...)` were
added.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark pudf-cleanup
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20531.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20531
----
commit ec708d58001be1382cabbe4357cbc68e2d51a8b6
Author: hyukjinkwon <gurwls223@...>
Date: 2018-02-07T00:18:20Z
Explicitly specify supported types with Pandas UDFs
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]