GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/19147
[WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Python ## What changes were proposed in this pull request? This pr introduces vectorized UDFs in Python. Note that this pr should focus on APIs for vectorized UDFs, not APIs for vectorized UDAFs or Window operations. **Proposed API** We introduce a `@pandas_udf` decorator (or annotation) to define vectorized UDFs which takes one or more `pandas.Series` or one integer value meaning the length of the input value for 0-parameter UDFs. The return value should be `pandas.Series` of the specified type and the length of the returned value should be the same as input value. We can define vectorized UDFs as: ```python @pandas_udf(DoubleType()) def plus(v1, v2): return v1 + v2 ``` or we can define as: ```python plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType()) ``` We can use it similar to row-by-row UDFs: ```python df.withColumn('sum', plus(df.v1, df.v2)) ``` As for 0-parameter UDFs, we can define and use as: ```python @pandas_udf(LongType()) def f0(size): return pd.Series(1).repeat(size) df.select(f0()) ``` ## How was this patch tested? Added tests and existing tests. ## TBD - [ ] the way to specify the size hint to 0-parameter UDF (or for more parameter UDFs to be consistent) You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-21990/vectorizedudfs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19147.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19147 ---- commit a2a3f8205394aa3d48a22972d00263897e0035b1 Author: Takuya UESHIN <ues...@databricks.com> Date: 2017-08-23T09:05:05Z Introduce vectorized UDF in Python. commit a1e4f62c993cc4c35523ed947178a72a2aadb753 Author: Takuya UESHIN <ues...@databricks.com> Date: 2017-09-06T04:46:21Z Add check if the length of returned value is the same as input value. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org