[ 
https://issues.apache.org/jira/browse/SPARK-27775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27775:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> Support multiple return values for udf
> --------------------------------------
>
>                 Key: SPARK-27775
>                 URL: https://issues.apache.org/jira/browse/SPARK-27775
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Xianjin YE
>            Priority: Major
>
> Hi, I'd like to proposal one improvement to Spark SQL, namely multi alias for 
> udf, which is inspired by one of our internal SQL systems.
>  
> Current Spark SQL and Hive don't support multiple return values  for one udf. 
> One alternative would be returning StructType for UDF, and then select 
> corresponding fields. Two downsides about that approach:
>  * The SQL is more complex than multi alias, quite unreadable for multiple 
> similar UDFs.
>  * the UDF code is evaluated multiple times, one time per Projection.
> for example, suppose one udf is defined as below:
> {code:java}
> // Scala
> def myFunc: (String => (String, String)) = { s => println("xx"); 
> (s.toLowerCase, s.toUpperCase)}
> val myUDF = udf(myFunc)
> {code}
> To select multiple fields of myUDF,  I have to do:
> {code:java}
> // Scala
> spark.sql("select id, myUDF(id)._1, myUDF(id)._2 from t1").explain()
> == Physical Plan ==
> *(1) Project [cast(id#12L as string) AS id#14, UDF(cast(id#12L as string))._1 
> AS UDF(id)._1#163, UDF(cast(id#12L as string))._2 AS UDF(id)._2#164]
> +- *(1) Range (0, 10, step=1, splits=48)
> {code}
> or 
> {code:java}
> // Scala
> spark.sql("select id, id1._1, id1._2 from (select id, myUDF(id) as id1 from 
> t1) t2").explain()
> == Physical Plan == *(1) Project [cast(id#12L as string) AS id#14, 
> UDF(cast(id#12L as string))._1 AS _1#155, UDF(cast(id#12L as string))._2 AS 
> _2#156] +- *(1) Range (0, 10, step=1, splits=48)
> {code}
>  The udf `myUDF` has to be evaluated twice for two projection.
> If we can support multi alias for structure returned udf, we can simply do 
> this, and extract multiple return values with only one evaluation of udf.
> {code:java}
> // Scala
> spark.sql("select id, myUDF(id) as (x, y) from t1"){code}
>  
> [SPARK-5383|https://issues.apache.org/jira/browse/SPARK-5383] adds multi 
> alias support for udtfs, the support for udfs is not. cc [~scwf] and 
> [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to