I certainly can, but the problem I’m facing is that of how best to track all the DataFrames I no longer want to persist.
I create and persist various DataFrames throughout my pipeline. Spark is already tracking all this for me, and exposing some of that tracking information via getPersistentRDDs(). So when I arrive at a point in my program where I know, “I only need this DataFrame going forward”, I want to be able to tell Spark “Please unpersist everything except this one DataFrame”. If I cannot leverage the information about persisted DataFrames that Spark is already tracking, then the alternative is for me to carefully track and unpersist DataFrames when I no longer need them. I suppose the problem is similar at a high level to garbage collection. Tracking and freeing DataFrames manually is analogous to malloc and free; and full automation would be Spark automatically unpersisting DataFrames when they were no longer referenced or needed. I’m looking for an in-between solution that lets me leverage some of the persistence tracking in Spark so I don’t have to do it all myself. Does this make more sense now, from a use case perspective as well as from a desired API perspective? On Thu, May 3, 2018 at 10:26 PM Reynold Xin <r...@databricks.com> wrote: > Why do you need the underlying RDDs? Can't you just unpersist the > dataframes that you don't need? > > > On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> This seems to be an underexposed part of the API. My use case is this: I >> want to unpersist all DataFrames except a specific few. I want to do this >> because I know at a specific point in my pipeline that I have a handful of >> DataFrames that I need, and everything else is no longer needed. >> >> The problem is that there doesn’t appear to be a way to identify specific >> DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(), >> which is the only way I’m aware of to ask Spark for all currently persisted >> RDDs: >> >> >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> >> >>> list(spark.sparkContext._jsc.getPersistentRDDs().items()) >> [(3, JavaObject id=o36)] >> >> As you can see, the id of the persisted RDD, 8, doesn’t match the id >> returned by getPersistentRDDs(), 3. So I can’t go through the RDDs >> returned by getPersistentRDDs() and know which ones I want to keep. >> >> id() itself appears to be an undocumented method of the RDD API, and in >> PySpark getPersistentRDDs() is buried behind the Java sub-objects >> <https://issues.apache.org/jira/browse/SPARK-2141>, so I know I’m >> reaching here. But is there a way to do what I want in PySpark without >> manually tracking everything I’ve persisted myself? >> >> And more broadly speaking, do we want to add additional APIs, or >> formalize currently undocumented APIs like id(), to make this use case >> possible? >> >> Nick >> >> >