Hi Burak, Most discussions of checkpointing in the docs is related to Spark streaming. Are you talking about the sparkContext.setCheckpointDir()? What effect does that have?
https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing On Wed, Sep 17, 2014 at 7:44 AM, Burak Yavuz <bya...@stanford.edu> wrote: > Hi, > > The files you mentioned are temporary files written by Spark during > shuffling. ALS will write a LOT of those files as it is a shuffle heavy > algorithm. > Those files will be deleted after your program completes as Spark looks > for those files in case a fault occurs. Having those files ready allows > Spark to > continue from the stage the shuffle left off, instead of starting from the > very beginning. > > Long story short, it's to your benefit that Spark writes those files to > disk. If you don't want Spark writing to disk, you can specify a checkpoint > directory in > HDFS, where Spark will write the current status instead and will clean up > files from disk. > > Best, > Burak > > ----- Original Message ----- > From: "Макар Красноперов" <connector....@gmail.com> > To: user@spark.apache.org > Sent: Wednesday, September 17, 2014 7:37:49 AM > Subject: Spark and disk usage. > > Hello everyone. > > The problem is that spark write data to the disk very hard, even if > application has a lot of free memory (about 3.8g). > So, I've noticed that folder with name like > "spark-local-20140917165839-f58c" contains a lot of other folders with > files like "shuffle_446_0_1". The total size of files in the dir > "spark-local-20140917165839-f58c" can reach 1.1g. > Sometimes its size decreases (are there only temp files in that folder?), > so the totally amount of data written to the disk is greater than 1.1g. > > The question is what kind of data Spark store there and can I make spark > not to write it on the disk and just keep it in the memory if there is > enough RAM free space? > > I run my job locally with Spark 1.0.1: > ./bin/spark-submit --driver-memory 12g --master local[3] --properties-file > conf/spark-defaults.conf --class my.company.Main /path/to/jar/myJob.jar > > spark-defaults.conf : > spark.shuffle.spill false > spark.reducer.maxMbInFlight 1024 > spark.shuffle.file.buffer.kb 2048 > spark.storage.memoryFraction 0.7 > > The situation with disk usage is common for many jobs. I had also used ALS > from MLIB and saw the similar things. > > I had reached no success by playing with spark configuration and i hope > someone can help me :) > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >