Jose Silva created SPARK-29035: ---------------------------------- Summary: unpersist() ignoring cache/persist() Key: SPARK-29035 URL: https://issues.apache.org/jira/browse/SPARK-29035 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.4.3 Environment: Amazon EMR - Spark 2.4.3 Reporter: Jose Silva
Calling unpersist(), even though the DataFrame is not used anymore removes all the InMemoryTableScan from the DAG. Here's a simplified version of the code i'm using: df = spark.read(...).where(...).cache() df_a = union(df.select(...), df.select(...), df.select(...)) df_b = df.select(...) df_c = df.select(...) df_d = df.select(...) df.unpersist() join(df_a, df_b, df_c, df_d).write() I've created an [album |https://imgur.com/a/c1xGq0r]with the two DAGs, with and without the unpersist() call. I call unpersist in order to prevent OoM during the join. From what I understand even though all the DataFrames come from df, unpersisting df after doing the selects shouldn't ignore the cache call, right? -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org