Identifying specific persisted DataFrames via getPersistentRDDs()

Nicholas Chammas Mon, 30 Apr 2018 20:18:07 -0700

This seems to be an underexposed part of the API. My use case is this: I
want to unpersist all DataFrames except a specific few. I want to do this
because I know at a specific point in my pipeline that I have a handful of
DataFrames that I need, and everything else is no longer needed.


The problem is that there doesn’t appear to be a way to identify specific
DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(),
which is the only way I’m aware of to ask Spark for all currently persisted
RDDs:

>>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
>>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
[(3, JavaObject id=o36)]

As you can see, the id of the persisted RDD, 8, doesn’t match the id
returned by getPersistentRDDs(), 3. So I can’t go through the RDDs returned
by getPersistentRDDs() and know which ones I want to keep.

id() itself appears to be an undocumented method of the RDD API, and in
PySpark getPersistentRDDs() is buried behind the Java sub-objects
<https://issues.apache.org/jira/browse/SPARK-2141>, so I know I’m reaching
here. But is there a way to do what I want in PySpark without manually
tracking everything I’ve persisted myself?

And more broadly speaking, do we want to add additional APIs, or formalize
currently undocumented APIs like id(), to make this use case possible?

Nick

Identifying specific persisted DataFrames via getPersistentRDDs()

Reply via email to