Re: Identifying specific persisted DataFrames via getPersistentRDDs()

Nicholas Chammas Tue, 08 May 2018 06:00:50 -0700

I certainly can, but the problem I’m facing is that of how best to track
all the DataFrames I no longer want to persist.

I create and persist various DataFrames throughout my pipeline. Spark is
already tracking all this for me, and exposing some of that tracking
information via getPersistentRDDs(). So when I arrive at a point in my
program where I know, “I only need this DataFrame going forward”, I want to
be able to tell Spark “Please unpersist everything except this one
DataFrame”. If I cannot leverage the information about persisted DataFrames
that Spark is already tracking, then the alternative is for me to carefully
track and unpersist DataFrames when I no longer need them.

I suppose the problem is similar at a high level to garbage collection.
Tracking and freeing DataFrames manually is analogous to malloc and free;
and full automation would be Spark automatically unpersisting DataFrames
when they were no longer referenced or needed. I’m looking for an
in-between solution that lets me leverage some of the persistence tracking
in Spark so I don’t have to do it all myself.

Does this make more sense now, from a use case perspective as well as from
a desired API perspective?

On Thu, May 3, 2018 at 10:26 PM Reynold Xin <r...@databricks.com> wrote:

> Why do you need the underlying RDDs? Can't you just unpersist the
> dataframes that you don't need?
>
>
> On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This seems to be an underexposed part of the API. My use case is this: I
>> want to unpersist all DataFrames except a specific few. I want to do this
>> because I know at a specific point in my pipeline that I have a handful of
>> DataFrames that I need, and everything else is no longer needed.
>>
>> The problem is that there doesn’t appear to be a way to identify specific
>> DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(),
>> which is the only way I’m aware of to ask Spark for all currently persisted
>> RDDs:
>>
>> >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
>> >>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
>> [(3, JavaObject id=o36)]
>>
>> As you can see, the id of the persisted RDD, 8, doesn’t match the id
>> returned by getPersistentRDDs(), 3. So I can’t go through the RDDs
>> returned by getPersistentRDDs() and know which ones I want to keep.
>>
>> id() itself appears to be an undocumented method of the RDD API, and in
>> PySpark getPersistentRDDs() is buried behind the Java sub-objects
>> <https://issues.apache.org/jira/browse/SPARK-2141>, so I know I’m
>> reaching here. But is there a way to do what I want in PySpark without
>> manually tracking everything I’ve persisted myself?
>>
>> And more broadly speaking, do we want to add additional APIs, or
>> formalize currently undocumented APIs like id(), to make this use case
>> possible?
>>
>> Nick
>> 
>>
>

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

Reply via email to