I see. Thanks for the clarification. It's not a a big issue but I am surprised my UDF can be executed in planning phase. If my UDF is doing something expensive it could get weird.
On Fri, Jun 8, 2018 at 3:44 PM, Reynold Xin <r...@databricks.com> wrote: > But from the user's perspective, optimization is not run right? So it is > still lazy. > > > On Fri, Jun 8, 2018 at 12:35 PM Li Jin <ice.xell...@gmail.com> wrote: > >> Hi All, >> >> Sorry for the long email title. I am a bit surprised to find that the >> current optimizer rule "ConvertToLocalRelation" causes expressions to be >> eager-evaluated in planning phase, this can be demonstrated with the >> following code: >> >> scala> val myUDF = udf((x: String) => { println("UDF evaled"); "result" }) >> >> myUDF: org.apache.spark.sql.expressions.UserDefinedFunction = >> UserDefinedFunction(<function1>,StringType,Some(List(StringType))) >> >> >> scala> val df = Seq(("test", 1)).toDF("s", "i").select(myUDF($"s")) >> >> df: org.apache.spark.sql.DataFrame = [UDF(s): string] >> >> >> scala> println(df.queryExecution.optimizedPlan) >> >> UDF evaled >> >> LocalRelation [UDF(s)#9] >> >> This is somewhat unexpected to me because of Spark's lazy execution >> model. >> >> I am wondering if this behavior is by design? >> >> Thanks! >> Li >> >> >>