Hi Everyone!

I'm trying to understand how Spark's cache work.

Here is my naive understanding, please let me know if I'm missing something:

val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")

In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when
rdd2 is saved I assume) and then from cache (assuming there is enough RAM)
when rdd3 is saved)

Now here is my question. Let's say I want to cache rdd2 and rdd3 as they
will both be used later on, but I don't need rdd1 after creating them.

Basically there is duplication, isn't it? Since once rdd2 and rdd3 are
calculated, I don't need rdd1 anymore, I should probably unpersist it,
right? the question is when?

*Will this work? (Option A)*

val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)
rdd2.cache()
rdd3.cache()
rdd1.unpersist()

Does spark add the unpersist call to the DAG? or is it done immediately? if
it's done immediately, then basically rdd1 will be non cached when I read
from rdd2 and rdd3, right?

*Should I do it this way instead (Option B)?*

val rdd1 = sc.textFile("some data")
rdd.cache() //marks rdd as cached
val rdd2 = rdd1.filter(...)
val rdd3 = rdd1.map(...)

rdd2.cache()
rdd3.cache()

rdd2.saveAsTextFile("...")
rdd3.saveAsTextFile("...")

rdd1.unpersist()

*So the question is this:* Is Option A good enough? e.g. will rdd1 be still
accessing the file only once? Or do I need to go with Option B?

(see also
http://stackoverflow.com/questions/29903675/understanding-sparks-caching)

Thanks in advance

Reply via email to