Re: Spark wastes a lot of space (tmp data) for iterative jobs

Alexis Gillain Wed, 16 Sep 2015 11:05:06 -0700

Ok just realized you don't use mllib pagerank.

You must use checkpointing as pointed in the databricks url.


https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala

Due to lineage Spark doesn't erase the shuffle file.
When you do :
contrib = link.join(rank)
rank = contrib.map(...)
contrib=link.join(rank)
I think Spark doesn't erase the shuffle files of the first join because
they are still part of the lineage of the second contrib through rank.

Have a look at this : https://www.youtube.com/watch?v=1MWxIUoIYFA

2015-09-16 22:16 GMT+08:00 Ali Hadian <had...@comp.iust.ac.ir>:

> Thanks for your response, Alexis.
>
> I have seen this page, but its suggested solutions do not work and the tmp
> space still grows linearly after unpersisting RDDs and calling
> System.gc() in each iteration.
>
> I think it might be due to one of the following reasons:
>
> 1. System.gc() does not directly invoke the garbage collector, but it just
> requests JVM to run GC, and JVM usually postpones it until memory is
> almost filled. However, since we are just running out of hard-disk space
> (not memory space), GC does not run; therefore the finalize() methods for
> the intermediate RDDs are not triggered.
>
>
> 2. System.gc() is only executed on the driver, but not on the workers (Is
> it how it works??!!)
>
> Any suggestions?
>
> Kind regards
> Ali Hadian
>
> -----Original Message-----
> From: Alexis Gillain <alexis.gill...@googlemail.com>
> To: Ali Hadian <had...@comp.iust.ac.ir>
> Cc: spark users <user@spark.apache.org>
> Date: Wed, 16 Sep 2015 12:05:35 +0800
> Subject: Re: Spark wastes a lot of space (tmp data) for iterative jobs
>
> You can try system.gc() considering that checkpointing is enabled by
> default in graphx :
>
>
> https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
>
> 2015-09-15 22:42 GMT+08:00 Ali Hadian < had...@comp.iust.ac.ir>:
>
>> Hi!
>> We are executing the PageRank example from the Spark java examples
>> package on a very large input graph. The code is available here
>> <https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java>.
>> (Spark's github repo).
>> During the execution, the framework generates huge amount of intermediate
>> data per each iteration (i.e. the *contribs* RDD). The intermediate data
>> is temporary, but Spark does not clear the intermediate data of previous
>> iterations. That is to say, if we are in the middle of 20th iteration, all
>> of the temporary data of all previous iterations (iteration 0 to 19) are
>> still kept in the *tmp*  directory. As a result, the tmp directory grows
>> linearly.
>> It seems rational to keep the data from only the previous iteration,
>> because if the current iteration fails, the job can be continued using the
>> intermediate data from the previous iteration. Anyways, why does it keep
>> the intermediate data for ALL previous iterations???
>> How can we enforce Spark to clear these intermediate data * during* the
>> execution of job?
>>
>> Kind regards,
>> Ali hadian
>>
>>
>
>
>
> --
> Alexis GILLAIN
>
>


-- 
Alexis GILLAIN

Re: Spark wastes a lot of space (tmp data) for iterative jobs

Reply via email to