Hi,in spark 2.4.0,unpersist a DF only un-cache the given DataSet ,and
re-compile dependent cached queries after removing the cached query., just
like the question in https://issues.apache.org/jira/browse/SPARK-21478.
When all the Jobs are done, we unpersist the cached data, It can take a
long time to rebuild the cached data which will never be used again. Take
the following code for example.
val x1 = Seq(1).toDF()
x1.persist()
val x2 = x1.select($"value" * 2)
x2.persist()
val x3 = x2.select($"value" * 2)
x3.persist()
x1.count()
x2.count()
x3.count()
...
x1.unpersist() // never be used again, but will re-compile dependent
cached queries: x2, x3
x2.unpersist() //never be used again, but will re-compile dependent
cached queries: x3
x3.persist() // never be used again
So, can we expose the parameters *cascade* in the unpersist method.Let the
user choose whether to rebuild or not
def unpersist(blocking: Boolean): this.type = {
sparkSession.sharedState.cacheManager.uncacheQuery(this, cascade =
false, blocking)
this
}