Hi Burak,

Most discussions of checkpointing in the docs is related to Spark
streaming.  Are you talking about the sparkContext.setCheckpointDir()?
 What effect does that have?

https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

On Wed, Sep 17, 2014 at 7:44 AM, Burak Yavuz <bya...@stanford.edu> wrote:

> Hi,
>
> The files you mentioned are temporary files written by Spark during
> shuffling. ALS will write a LOT of those files as it is a shuffle heavy
> algorithm.
> Those files will be deleted after your program completes as Spark looks
> for those files in case a fault occurs. Having those files ready allows
> Spark to
> continue from the stage the shuffle left off, instead of starting from the
> very beginning.
>
> Long story short, it's to your benefit that Spark writes those files to
> disk. If you don't want Spark writing to disk, you can specify a checkpoint
> directory in
> HDFS, where Spark will write the current status instead and will clean up
> files from disk.
>
> Best,
> Burak
>
> ----- Original Message -----
> From: "Макар Красноперов" <connector....@gmail.com>
> To: user@spark.apache.org
> Sent: Wednesday, September 17, 2014 7:37:49 AM
> Subject: Spark and disk usage.
>
> Hello everyone.
>
> The problem is that spark write data to the disk very hard, even if
> application has a lot of free memory (about 3.8g).
> So, I've noticed that folder with name like
> "spark-local-20140917165839-f58c" contains a lot of other folders with
> files like "shuffle_446_0_1". The total size of files in the dir
> "spark-local-20140917165839-f58c" can reach 1.1g.
> Sometimes its size decreases (are there only temp files in that folder?),
> so the totally amount of data written to the disk is greater than 1.1g.
>
> The question is what kind of data Spark store there and can I make spark
> not to write it on the disk and just keep it in the memory if there is
> enough RAM free space?
>
> I run my job locally with Spark 1.0.1:
> ./bin/spark-submit --driver-memory 12g --master local[3] --properties-file
> conf/spark-defaults.conf --class my.company.Main /path/to/jar/myJob.jar
>
> spark-defaults.conf :
> spark.shuffle.spill             false
> spark.reducer.maxMbInFlight     1024
> spark.shuffle.file.buffer.kb    2048
> spark.storage.memoryFraction    0.7
>
> The situation with disk usage is common for many jobs. I had also used ALS
> from MLIB and saw the similar things.
>
> I had reached no success by playing with spark configuration and i hope
> someone can help me :)
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to