Thanks for the response, Conor. I tried with those settings and for a while it seemed like it was cleaning up shuffle files after itself. However, after exactly 5 hours later it started throwing exceptions and eventually stopped working again. A sample stack trace is below. What is curious about 5 hours is that I set the cleaner ttl to 5 hours after changing the max window size to 1 hour (down from 6 hours in order to test). It also stopped cleaning the shuffle files after this started happening.
Any idea why this could be happening? 2015-04-22 17:39:52,040 ERROR Executor task launch worker-989 Executor.logError - Exception in task 0.0 in stage 215425.0 (TID 425147) java.lang.Exception: Could not compute split, block input-0-1429706099000 not found at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:198) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Thanks NB On Tue, Apr 21, 2015 at 5:14 AM, Conor Fennell <conor.fenn...@altocloud.com> wrote: > Hi, > > > We set the spark.cleaner.ttl to some reasonable time and also > set spark.streaming.unpersist=true. > > > Those together cleaned up the shuffle files for us. > > > -Conor > > On Tue, Apr 21, 2015 at 8:18 AM, N B <nb.nos...@gmail.com> wrote: > >> We already do have a cron job in place to clean just the shuffle files. >> However, what I would really like to know is whether there is a "proper" >> way of telling spark to clean up these files once its done with them? >> >> Thanks >> NB >> >> >> On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele <gangele...@gmail.com >> > wrote: >> >>> Write a crone job for this like below >>> >>> 12 * * * * find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+ >>> 32 * * * * find /tmp -type d -cmin +1440 -name "spark-*-*-*" -prune >>> -exec rm -rf {} \+ >>> 52 * * * * find $SPARK_LOCAL_DIR -mindepth 1 -maxdepth 1 -type d -cmin >>> +1440 -name "spark-*-*-*" -prune -exec rm -rf {} \+ >>> >>> >>> On 20 April 2015 at 23:12, N B <nb.nos...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> I had posed this query as part of a different thread but did not get a >>>> response there. So creating a new thread hoping to catch someone's >>>> attention. >>>> >>>> We are experiencing this issue of shuffle files being left behind and >>>> not being cleaned up by Spark. Since this is a Spark streaming application, >>>> it is expected to stay up indefinitely, so shuffle files not being cleaned >>>> up is a big problem right now. Our max window size is 6 hours, so we have >>>> set up a cron job to clean up shuffle files older than 12 hours otherwise >>>> it will eat up all our disk space. >>>> >>>> Please see the following. It seems the non-cleaning of shuffle files is >>>> being documented in 1.3.1. >>>> >>>> https://github.com/apache/spark/pull/5074/files >>>> https://issues.apache.org/jira/browse/SPARK-5836 >>>> >>>> >>>> Also, for some reason, the following JIRAs that were reported as >>>> functional issues were closed as Duplicates of the above Documentation bug. >>>> Does this mean that this issue won't be tackled at all? >>>> >>>> https://issues.apache.org/jira/browse/SPARK-3563 >>>> https://issues.apache.org/jira/browse/SPARK-4796 >>>> https://issues.apache.org/jira/browse/SPARK-6011 >>>> >>>> Any further insight into whether this is being looked into and >>>> meanwhile how to handle shuffle files will be greatly appreciated. >>>> >>>> Thanks >>>> NB >>>> >>>> >>> >>> >>> >>> >> >