This seems to be an underexposed part of the API. My use case is this: I want to unpersist all DataFrames except a specific few. I want to do this because I know at a specific point in my pipeline that I have a handful of DataFrames that I need, and everything else is no longer needed.
The problem is that there doesn’t appear to be a way to identify specific DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(), which is the only way I’m aware of to ask Spark for all currently persisted RDDs: >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> >>> list(spark.sparkContext._jsc.getPersistentRDDs().items()) [(3, JavaObject id=o36)] As you can see, the id of the persisted RDD, 8, doesn’t match the id returned by getPersistentRDDs(), 3. So I can’t go through the RDDs returned by getPersistentRDDs() and know which ones I want to keep. id() itself appears to be an undocumented method of the RDD API, and in PySpark getPersistentRDDs() is buried behind the Java sub-objects <https://issues.apache.org/jira/browse/SPARK-2141>, so I know I’m reaching here. But is there a way to do what I want in PySpark without manually tracking everything I’ve persisted myself? And more broadly speaking, do we want to add additional APIs, or formalize currently undocumented APIs like id(), to make this use case possible? Nick