[ 
https://issues.apache.org/jira/browse/SPARK-53672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036639#comment-18036639
 ] 

Dongjoon Hyun commented on SPARK-53672:
---------------------------------------

We can add more new feature subtasks until the mid of November and more bug fix 
subtasks until the official Apache Spark 4.1.0 release.

> Unified interface for UDF
> -------------------------
>
>                 Key: SPARK-53672
>                 URL: https://issues.apache.org/jira/browse/SPARK-53672
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 4.1.0
>            Reporter: Ruifeng Zheng
>            Assignee: Ruifeng Zheng
>            Priority: Major
>             Fix For: 4.1.0
>
>
> We would extend @udf by allowing more input types and return types to promote 
> vectorized UDFs.
>  
> Vectorized UDFs (based on Pandas and PyArrow) are much more performant than 
> normal Python UDFs, but people are still using the normal UDFs in most cases.
>  
>  
> Existing normal UDFs take Python objects (e.g. bool/int/float/dict) and Rows 
> (for StructType) as inputs and outputs, for example:
> {code:java}
> @udf(returnType=IntegerType())
> def calc(a: int, b: int) -> int:
>     return a + 10 * b{code}
>  
> While in vectorized UDFs, the type hints are used to infer the evaluation 
> types.
>  
> Proposal: Allow Pandas/PyArrow objects (pd.Series/pd.DataFrame/pa.Array/etc) 
> as input types and return types, try to infer the corresponding evaluation 
> type based on the signature and type hints. If the type hints match one of 
> the patterns of supported vectorized UDFs (e.g. the Series to Series type in 
> [pandas_udf|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html?highlight=pandas_udf#pyspark.sql.functions.pandas_udf]),
>  then the UDF is treated as a vectorized UDF.
>  
>  
> Example:
> {code:java}
> import pandas as pd
>  
> @udf(returnType=IntegerType())
> def calc(a: pd.Series, b: pd.Series) -> pd.Series:
>     return a + 10 * b{code}
> will be converted to
> {code:java}
> import pandas as pd
>  
> @pandas_udf(returnType=IntegerType(), PandasUDFType.SCALAR)
> def calc(a: pd.Series, b: pd.Series) -> pd.Series:
>     return a + 10 * b{code}
> The inferred UDF type here is PandasUDFType.SCALAR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to