Hi Everyone! I'm trying to understand how Spark's cache work.
Here is my naive understanding, please let me know if I'm missing something: val rdd1 = sc.textFile("some data") rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.saveAsTextFile("...") rdd3.saveAsTextFile("...") In the above, rdd1 will be loaded from disk (e.g. HDFS) only once. (when rdd2 is saved I assume) and then from cache (assuming there is enough RAM) when rdd3 is saved) Now here is my question. Let's say I want to cache rdd2 and rdd3 as they will both be used later on, but I don't need rdd1 after creating them. Basically there is duplication, isn't it? Since once rdd2 and rdd3 are calculated, I don't need rdd1 anymore, I should probably unpersist it, right? the question is when? *Will this work? (Option A)* val rdd1 = sc.textFile("some data") rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd1.unpersist() Does spark add the unpersist call to the DAG? or is it done immediately? if it's done immediately, then basically rdd1 will be non cached when I read from rdd2 and rdd3, right? *Should I do it this way instead (Option B)?* val rdd1 = sc.textFile("some data") rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.cache() rdd3.cache() rdd2.saveAsTextFile("...") rdd3.saveAsTextFile("...") rdd1.unpersist() *So the question is this:* Is Option A good enough? e.g. will rdd1 be still accessing the file only once? Or do I need to go with Option B? (see also http://stackoverflow.com/questions/29903675/understanding-sparks-caching) Thanks in advance