[
https://issues.apache.org/jira/browse/SPARK-53672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036639#comment-18036639
]
Dongjoon Hyun commented on SPARK-53672:
---------------------------------------
We can add more new feature subtasks until the mid of November and more bug fix
subtasks until the official Apache Spark 4.1.0 release.
> Unified interface for UDF
> -------------------------
>
> Key: SPARK-53672
> URL: https://issues.apache.org/jira/browse/SPARK-53672
> Project: Spark
> Issue Type: Umbrella
> Components: PySpark
> Affects Versions: 4.1.0
> Reporter: Ruifeng Zheng
> Assignee: Ruifeng Zheng
> Priority: Major
> Fix For: 4.1.0
>
>
> We would extend @udf by allowing more input types and return types to promote
> vectorized UDFs.
>
> Vectorized UDFs (based on Pandas and PyArrow) are much more performant than
> normal Python UDFs, but people are still using the normal UDFs in most cases.
>
>
> Existing normal UDFs take Python objects (e.g. bool/int/float/dict) and Rows
> (for StructType) as inputs and outputs, for example:
> {code:java}
> @udf(returnType=IntegerType())
> def calc(a: int, b: int) -> int:
> return a + 10 * b{code}
>
> While in vectorized UDFs, the type hints are used to infer the evaluation
> types.
>
> Proposal: Allow Pandas/PyArrow objects (pd.Series/pd.DataFrame/pa.Array/etc)
> as input types and return types, try to infer the corresponding evaluation
> type based on the signature and type hints. If the type hints match one of
> the patterns of supported vectorized UDFs (e.g. the Series to Series type in
> [pandas_udf|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html?highlight=pandas_udf#pyspark.sql.functions.pandas_udf]),
> then the UDF is treated as a vectorized UDF.
>
>
> Example:
> {code:java}
> import pandas as pd
>
> @udf(returnType=IntegerType())
> def calc(a: pd.Series, b: pd.Series) -> pd.Series:
> return a + 10 * b{code}
> will be converted to
> {code:java}
> import pandas as pd
>
> @pandas_udf(returnType=IntegerType(), PandasUDFType.SCALAR)
> def calc(a: pd.Series, b: pd.Series) -> pd.Series:
> return a + 10 * b{code}
> The inferred UDF type here is PandasUDFType.SCALAR.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]