[
https://issues.apache.org/jira/browse/SPARK-32046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203733#comment-17203733
]
Pablo Langa Blanco commented on SPARK-32046:
--------------------------------------------
When you cache a Dataframe, you do not save the name of the dataframe, what is
saved is a simplified (canonicalized) plan. Then, another different Dataframe,
with the same canonicalized plan will re-use the cached dataframe. In our
example:
{code:java}
scala> val df1 = spark.range(1).select(current_timestamp as "datetime")
scala> df1.queryExecution.analyzed.canonicalized
res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Project [current_timestamp() AS #0]
+- Range (0, 1, step=1, splits=Some(4))
scala> val df2 = spark.range(1).select(current_timestamp as "datetime")
scala> df2.queryExecution.analyzed.canonicalized
Project [current_timestamp() AS #0]
+- Range (0, 1, step=1, splits=Some(4))
scala> df1.queryExecution.analyzed.canonicalized ==
df2.queryExecution.analyzed.canonicalized
res2: Boolean = true
scala> df2.explain(true)
== Optimized Logical Plan ==
Project [1601363033780000 AS datetime#6]
+- Range (0, 1, step=1, splits=Some(4))
scala> df1.cache
scala> df2.explain(true)
== Optimized Logical Plan ==
InMemoryRelation [datetime#6], true, 10000, StorageLevel(disk, memory,
deserialized, 1 replicas)
+- *(1) Project [1601363046569000 AS datetime#2]
+- *(1) Range (0, 1, step=1, splits=4)
{code}
What do you want to say with Java implementation? what is your actual work
arround? Just for understanding if I miss something
> current_timestamp called in a cache dataframe freezes the time for all future
> calls
> -----------------------------------------------------------------------------------
>
> Key: SPARK-32046
> URL: https://issues.apache.org/jira/browse/SPARK-32046
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0, 2.4.4, 3.0.0
> Reporter: Dustin Smith
> Priority: Minor
> Labels: caching, sql, time
>
> If I call current_timestamp 3 times while caching the dataframe variable in
> order to freeze that dataframe's time, the 3rd dataframe time and beyond
> (4th, 5th, ...) will be frozen to the 2nd dataframe's time. The 1st dataframe
> and the 2nd will differ in time but will become static on the 3rd usage and
> beyond (when running on Zeppelin or Jupyter).
> Additionally, caching only caused 2 dataframes to cache skipping the 3rd.
> However,
> {code:java}
> val df = Seq(java.time.LocalDateTime.now.toString).toDF("datetime").cache
> df.count
> // this can be run 3 times no issue.
> // then later cast to TimestampType{code}
> doesn't have this problem and all 3 dataframes cache with correct times
> displaying.
> Running the code in shell and Jupyter or Zeppelin (ZP) also produces
> different results. In the shell, you only get 1 unique time no matter how
> many times you run it, current_timestamp. However, in ZP or Jupyter I have
> always received 2 unique times before it froze.
>
> {code:java}
> val df1 = spark.range(1).select(current_timestamp as "datetime").cache
> df1.count
> df1.show(false)
> Thread.sleep(9500)
> val df2 = spark.range(1).select(current_timestamp as "datetime").cache
> df2.count
> df2.show(false)
> Thread.sleep(9500)
> val df3 = spark.range(1).select(current_timestamp as "datetime").cache
> df3.count
> df3.show(false){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]