[
https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649437#comment-14649437
]
Justin Uang commented on SPARK-9141:
------------------------------------
(Taken from spark dev email:
http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-rdd-doesn-t-respect-DataFrame-cache-slowing-down-CrossValidator-td13466.html)
If I call DataFrame#cache and DataFrame#count, and then get an rdd from
DataFrame#rdd, then the resulting RDD computes the entire lineage chain, not
using
cached DataFrame results.
Likewise, for the CrossValidation and LogisticRegression use case to work, we
would need this and the original of this bug to be fixed as well, such that if
I cached DataFrame A, then derive DataFrame B with DataFrame#withColumn, then
B.rdd() should return an RDD that uses the cached DataFrame A.
> DataFrame recomputed instead of using cached parent.
> ----------------------------------------------------
>
> Key: SPARK-9141
> URL: https://issues.apache.org/jira/browse/SPARK-9141
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.0, 1.4.1
> Reporter: Nick Pritchard
> Priority: Blocker
> Labels: cache, dataframe
>
> As I understand, DataFrame.cache() is supposed to work the same as
> RDD.cache(), so that repeated operations on it will use the cached results
> and not recompute the entire lineage. However, it seems that some DataFrame
> operations (e.g. withColumn) change the underlying RDD lineage so that cache
> doesn't work as expected.
> Below is a Scala example that demonstrates this. First, I define two UDF's
> that use println so that it is easy to see when they are being called. Next,
> I create a simple data frame with one row and two columns. Next, I add a
> column, cache it, and call count() to force the computation. Lastly, I add
> another column, cache it, and call count().
> I would have expected the last statement to only compute the last column,
> since everything else was cached. However, because withColumn() changes the
> lineage, the whole data frame is recomputed.
> {code:scala}
> // Examples udf's that println when called
> val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
> val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 }
> // Initial dataset
> val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value")
> // Add column by applying twice udf
> val df2 = df1.withColumn("twice", twice($"value"))
> df2.cache()
> df2.count() //prints Computed: twice(1)
> // Add column by applying triple udf
> val df3 = df2.withColumn("triple", triple($"value"))
> df3.cache()
> df3.count() //prints Computed: twice(1)\nComputed: triple(1)
> {code}
> I found a workaround, which helped me understand what was going on behind the
> scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD
> then back DataFrame, which seems to freeze the lineage. The code below shows
> the workaround for creating the second data frame so cache will work as
> expected.
> {code:scala}
> val df2 = {
> val tmp = df1.withColumn("twice", twice($"value"))
> sqlContext.createDataFrame(tmp.rdd, tmp.schema)
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]