[ https://issues.apache.org/jira/browse/SPARK-29035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930297#comment-16930297 ]
Jose Silva commented on SPARK-29035: ------------------------------------ [~hyukjin.kwon] What do you mean with "full reproducer"? > unpersist() ignoring cache/persist() > ------------------------------------ > > Key: SPARK-29035 > URL: https://issues.apache.org/jira/browse/SPARK-29035 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.3 > Environment: Amazon EMR - Spark 2.4.3 > Reporter: Jose Silva > Priority: Major > Original Estimate: 2h > Remaining Estimate: 2h > > Calling {{unpersist()}}, even though the {{DataFrame}} is not used anymore > removes all the InMemoryTableScan from the DAG. > Here's a simplified version of the code i'm using: > {code} > df = spark.read(...).where(...).cache() > df_a = union(df.select(...), df.select(...), df.select(...)) > df_b = df.select(...) > df_c = df.select(...) > df_d = df.select(...) > df.unpersist() > join(df_a, df_b, df_c, df_d).write() > {code} > I've created an [album |https://imgur.com/a/c1xGq0r]with the two DAGs, with > and without the {{unpersist()}} call. > I call unpersist in order to prevent OOM during the join. From what I > understand even though all the DataFrames come from df, unpersisting df after > doing the selects shouldn't ignore the cache call, right? -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org