[
https://issues.apache.org/jira/browse/SPARK-29035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean R. Owen resolved SPARK-29035.
----------------------------------
Resolution: Not A Problem
Your cache isn't doing anything, because you undo it before anything is
evaluated. Nothing is ignored here, you just never caused it to cache anything
before you told it not to cache df.
> unpersist() ignoring cache/persist()
> ------------------------------------
>
> Key: SPARK-29035
> URL: https://issues.apache.org/jira/browse/SPARK-29035
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.3
> Environment: Amazon EMR - Spark 2.4.3
> Reporter: Jose Silva
> Priority: Major
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> Calling {{unpersist()}}, even though the {{DataFrame}} is not used anymore
> removes all the InMemoryTableScan from the DAG.
> Here's a simplified version of the code i'm using:
> {code}
> df = spark.read(...).where(...).cache()
> df_a = union(df.select(...), df.select(...), df.select(...))
> df_b = df.select(...)
> df_c = df.select(...)
> df_d = df.select(...)
> df.unpersist()
> join(df_a, df_b, df_c, df_d).write()
> {code}
> I've created an [album |https://imgur.com/a/c1xGq0r]with the two DAGs, with
> and without the {{unpersist()}} call.
> I call unpersist in order to prevent OOM during the join. From what I
> understand even though all the DataFrames come from df, unpersisting df after
> doing the selects shouldn't ignore the cache call, right?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]