Ok just realized you don't use mllib pagerank. You must use checkpointing as pointed in the databricks url.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala Due to lineage Spark doesn't erase the shuffle file. When you do : contrib = link.join(rank) rank = contrib.map(...) contrib=link.join(rank) I think Spark doesn't erase the shuffle files of the first join because they are still part of the lineage of the second contrib through rank. Have a look at this : https://www.youtube.com/watch?v=1MWxIUoIYFA 2015-09-16 22:16 GMT+08:00 Ali Hadian <had...@comp.iust.ac.ir>: > Thanks for your response, Alexis. > > I have seen this page, but its suggested solutions do not work and the tmp > space still grows linearly after unpersisting RDDs and calling > System.gc() in each iteration. > > I think it might be due to one of the following reasons: > > 1. System.gc() does not directly invoke the garbage collector, but it just > requests JVM to run GC, and JVM usually postpones it until memory is > almost filled. However, since we are just running out of hard-disk space > (not memory space), GC does not run; therefore the finalize() methods for > the intermediate RDDs are not triggered. > > > 2. System.gc() is only executed on the driver, but not on the workers (Is > it how it works??!!) > > Any suggestions? > > Kind regards > Ali Hadian > > -----Original Message----- > From: Alexis Gillain <alexis.gill...@googlemail.com> > To: Ali Hadian <had...@comp.iust.ac.ir> > Cc: spark users <user@spark.apache.org> > Date: Wed, 16 Sep 2015 12:05:35 +0800 > Subject: Re: Spark wastes a lot of space (tmp data) for iterative jobs > > You can try system.gc() considering that checkpointing is enabled by > default in graphx : > > > https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html > > 2015-09-15 22:42 GMT+08:00 Ali Hadian < had...@comp.iust.ac.ir>: > >> Hi! >> We are executing the PageRank example from the Spark java examples >> package on a very large input graph. The code is available here >> <https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java>. >> (Spark's github repo). >> During the execution, the framework generates huge amount of intermediate >> data per each iteration (i.e. the *contribs* RDD). The intermediate data >> is temporary, but Spark does not clear the intermediate data of previous >> iterations. That is to say, if we are in the middle of 20th iteration, all >> of the temporary data of all previous iterations (iteration 0 to 19) are >> still kept in the *tmp* directory. As a result, the tmp directory grows >> linearly. >> It seems rational to keep the data from only the previous iteration, >> because if the current iteration fails, the job can be continued using the >> intermediate data from the previous iteration. Anyways, why does it keep >> the intermediate data for ALL previous iterations??? >> How can we enforce Spark to clear these intermediate data * during* the >> execution of job? >> >> Kind regards, >> Ali hadian >> >> > > > > -- > Alexis GILLAIN > > -- Alexis GILLAIN