What I understand is that rdd.cache() is really rdd.cache_this_rdd_when_it_actually_materializes(). So, somewhat esoteric problem may occur.
The example is as follows: void method1() { JavaRDD<...> rdd = sc.textFile(...) .map(...); rdd.cache(); // since the following methods can call the action methods multiple times, // cache the rdd to prevent rebuilding. method2(rdd); // may or may not call the action methods on rdd method3(rdd); // may or may not call the action methods on rdd // #HERE#, the action methods could have been called or not. rdd.saveAsTextFile(...); // if none of the above methods called the action methods, // rdd will materialize here and cached. // but we don't need the cache anymore. Caching was unnecessary. rdd.unpersist(); } If there were rdd.cancelCache() method and we could call it at #HERE#, unnecessary caching could be avoided. What cancelCache() would do is to cancel the pending request for caching, if caching is not done yet. It is different from unpersist(), since unpersist() undoes the caching that has been actually done. Will rdd.cancelCache() be really needed, or I'm misunderstanding the caching mechanism?