[
https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin Huai resolved SPARK-9141.
-----------------------------
Resolution: Fixed
Fix Version/s: 1.5.0
Issue resolved by pull request 7920
[https://github.com/apache/spark/pull/7920]
> DataFrame recomputed instead of using cached parent.
> ----------------------------------------------------
>
> Key: SPARK-9141
> URL: https://issues.apache.org/jira/browse/SPARK-9141
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.0, 1.4.1
> Reporter: Nick Pritchard
> Assignee: Michael Armbrust
> Priority: Blocker
> Labels: cache, dataframe
> Fix For: 1.5.0
>
>
> As I understand, DataFrame.cache() is supposed to work the same as
> RDD.cache(), so that repeated operations on it will use the cached results
> and not recompute the entire lineage. However, it seems that some DataFrame
> operations (e.g. withColumn) change the underlying RDD lineage so that cache
> doesn't work as expected.
> Below is a Scala example that demonstrates this. First, I define two UDF's
> that use println so that it is easy to see when they are being called. Next,
> I create a simple data frame with one row and two columns. Next, I add a
> column, cache it, and call count() to force the computation. Lastly, I add
> another column, cache it, and call count().
> I would have expected the last statement to only compute the last column,
> since everything else was cached. However, because withColumn() changes the
> lineage, the whole data frame is recomputed.
> {code:scala}
> // Examples udf's that println when called
> val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
> val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 }
> // Initial dataset
> val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value")
> // Add column by applying twice udf
> val df2 = df1.withColumn("twice", twice($"value"))
> df2.cache()
> df2.count() //prints Computed: twice(1)
> // Add column by applying triple udf
> val df3 = df2.withColumn("triple", triple($"value"))
> df3.cache()
> df3.count() //prints Computed: twice(1)\nComputed: triple(1)
> {code}
> I found a workaround, which helped me understand what was going on behind the
> scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD
> then back DataFrame, which seems to freeze the lineage. The code below shows
> the workaround for creating the second data frame so cache will work as
> expected.
> {code:scala}
> val df2 = {
> val tmp = df1.withColumn("twice", twice($"value"))
> sqlContext.createDataFrame(tmp.rdd, tmp.schema)
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]