[
https://issues.apache.org/jira/browse/SPARK-27775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-27775:
----------------------------------
Affects Version/s: (was: 3.0.0)
3.1.0
> Support multiple return values for udf
> --------------------------------------
>
> Key: SPARK-27775
> URL: https://issues.apache.org/jira/browse/SPARK-27775
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Xianjin YE
> Priority: Major
>
> Hi, I'd like to proposal one improvement to Spark SQL, namely multi alias for
> udf, which is inspired by one of our internal SQL systems.
>
> Current Spark SQL and Hive don't support multiple return values for one udf.
> One alternative would be returning StructType for UDF, and then select
> corresponding fields. Two downsides about that approach:
> * The SQL is more complex than multi alias, quite unreadable for multiple
> similar UDFs.
> * the UDF code is evaluated multiple times, one time per Projection.
> for example, suppose one udf is defined as below:
> {code:java}
> // Scala
> def myFunc: (String => (String, String)) = { s => println("xx");
> (s.toLowerCase, s.toUpperCase)}
> val myUDF = udf(myFunc)
> {code}
> To select multiple fields of myUDF, I have to do:
> {code:java}
> // Scala
> spark.sql("select id, myUDF(id)._1, myUDF(id)._2 from t1").explain()
> == Physical Plan ==
> *(1) Project [cast(id#12L as string) AS id#14, UDF(cast(id#12L as string))._1
> AS UDF(id)._1#163, UDF(cast(id#12L as string))._2 AS UDF(id)._2#164]
> +- *(1) Range (0, 10, step=1, splits=48)
> {code}
> or
> {code:java}
> // Scala
> spark.sql("select id, id1._1, id1._2 from (select id, myUDF(id) as id1 from
> t1) t2").explain()
> == Physical Plan == *(1) Project [cast(id#12L as string) AS id#14,
> UDF(cast(id#12L as string))._1 AS _1#155, UDF(cast(id#12L as string))._2 AS
> _2#156] +- *(1) Range (0, 10, step=1, splits=48)
> {code}
> The udf `myUDF` has to be evaluated twice for two projection.
> If we can support multi alias for structure returned udf, we can simply do
> this, and extract multiple return values with only one evaluation of udf.
> {code:java}
> // Scala
> spark.sql("select id, myUDF(id) as (x, y) from t1"){code}
>
> [SPARK-5383|https://issues.apache.org/jira/browse/SPARK-5383] adds multi
> alias support for udtfs, the support for udfs is not. cc [~scwf] and
> [~cloud_fan]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]