[ 
https://issues.apache.org/jira/browse/SPARK-26510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hagai Attias updated SPARK-26510:
---------------------------------
    Description: 
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", (x:Int) => {
  print(x)
  x
})

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 

  was:
It seems that there's a change of behaviour between 1.6 and 2.3 when caching a 
Dataframe and saving it as a temp table. In 1.6, the following code executed 
{{printUDF}} once. The equivalent code in 2.3 (or even same as is) executes it 
twice.
 
{code:java|title=RegisterTest.scala|borderStyle=solid}
 
val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
val schema = StructType(StructField("num", IntegerType) :: Nil)

val df1 = sql.createDataFrame(rdd, schema)
df1.registerTempTable("data_table")

sql.udf.register("printUDF", new UDF().print _)

val df2 = sql.sql("select printUDF(num) result from data_table").cache()

df2.collect() //execute cache

df2.registerTempTable("cached_table")

val df3 = sql.sql("select result from cached_table")

df3.collect()
{code}
 
1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. I 
managed to overcome by skipping the temporary table and selecting directly from 
the cached dataframe, but was wondering if that is an expected behavior / known 
issue.
 


> Spark 2.3 change of behavior (vs 1.6) when caching a dataframe and using 
> 'createOrReplaceTempView'
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26510
>                 URL: https://issues.apache.org/jira/browse/SPARK-26510
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.0
>            Reporter: Hagai Attias
>            Priority: Major
>
> It seems that there's a change of behaviour between 1.6 and 2.3 when caching 
> a Dataframe and saving it as a temp table. In 1.6, the following code 
> executed {{printUDF}} once. The equivalent code in 2.3 (or even same as is) 
> executes it twice.
>  
> {code:java|title=RegisterTest.scala|borderStyle=solid}
>  
> val rdd = context.parallelize(Seq(1, 2, 3)).map(Row(_))
> val schema = StructType(StructField("num", IntegerType) :: Nil)
> val df1 = sql.createDataFrame(rdd, schema)
> df1.registerTempTable("data_table")
> sql.udf.register("printUDF", (x:Int) => {
>   print(x)
>   x
> })
> val df2 = sql.sql("select printUDF(num) result from data_table").cache()
> df2.collect() //execute cache
> df2.registerTempTable("cached_table")
> val df3 = sql.sql("select result from cached_table")
> df3.collect()
> {code}
>  
> 1.6 prints 123 while 2.3 prints 123123, thus evaluating the dataframe twice. 
> I managed to overcome by skipping the temporary table and selecting directly 
> from the cached dataframe, but was wondering if that is an expected behavior 
> / known issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to