[ https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hagai Attias updated SPARK-26510: --------------------------------- Description: It seems that there's a change of behaviour between 1.6 and 2.3 when caching a Dataframe and saving it as a temp table. In 1.6, the following code executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it twice. {code:java|title=RegisterTest.scala|borderStyle=solid} val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_)) val schema = StructType(StructField("num", IntegerType) :: Nil) val df1 = sql.createDataFrame(rdd, schema) df1.registerTempTable("data_table") sql.udf.register("printUDF", (x:Int) => { print(x) x }) val df2 = sql.sql("select printUDF(num) result from data_table").cache() df2.collect() //execute cache df2.registerTempTable("cached_table") val df3 = sql.sql("select result from cached_table") df3.collect() {code} 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I managed to overcome by skipping the temporary table and selecting directly from the cached dataframe, but was wondering if that is an expected behavior / known issue. was: It seems that there's a change of behaviour between 1.6 and 2.3 when caching a Dataframe and saving it as a temp table. In 1.6, the following code executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it twice. {code:java|title=RegisterTest.scala|borderStyle=solid} val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_)) val schema = StructType(StructField("num", IntegerType) :: Nil) val df1 = sql.createDataFrame(rdd, schema) df1.registerTempTable("data_table") sql.udf.register("printUDF", new UDF().print _) val df2 = sql.sql("select printUDF(num) result from data_table").cache() df2.collect() //execute cache df2.registerTempTable("cached_table") val df3 = sql.sql("select result from cached_table") df3.collect() {code} 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I managed to overcome by skipping the temporary table and selecting directly from the cached dataframe, but was wondering if that is an expected behavior / known issue. > Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using > 'createOrReplaceTempView' > -------------------------------------------------------------------------------------------------- > > Key: SPARK-26510 > URL: https://issues.apache.org/jira/browse/SPARK-26510 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.3.0 > Reporter: Hagai Attias > Priority: Major > > It seems that there's a change of behaviour between 1.6 and 2.3 when caching > a Dataframe and saving it as a temp table. In 1.6, the following code > executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) > executes it twice. > > {code:java|title=RegisterTest.scala|borderStyle=solid} > > val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_)) > val schema = StructType(StructField("num", IntegerType) :: Nil) > val df1 = sql.createDataFrame(rdd, schema) > df1.registerTempTable("data_table") > sql.udf.register("printUDF", (x:Int) => { > print(x) > x > }) > val df2 = sql.sql("select printUDF(num) result from data_table").cache() > df2.collect() //execute cache > df2.registerTempTable("cached_table") > val df3 = sql.sql("select result from cached_table") > df3.collect() > {code} > > 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. > I managed to overcome by skipping the temporary table and selecting directly > from the cached dataframe, but was wondering if that is an expected behavior > / known issue. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org