Hi SK, For the problem with lots of shuffle files and the "too many open files" exception there are a couple options:
1. The linux kernel has a limit on the number of open files at once. This is set with ulimit -n, and can be set permanently in /etc/sysctl.conf or /etc/sysctl.d/. Try increasing this to a large value, at the bare minimum the square of your partition count. 2. Try using shuffle consolidation -- spark.shuffle.consolidateFiles=true This option writes fewer files to disk so shouldn't hit limits nearly as much 3. Try using the sort-based shuffle by setting spark.shuffle.manager=SORT. You should likely hold off on this until https://issues.apache.org/jira/browse/SPARK-3032 is fixed, hopefully in 1.1.1 Hope that helps! Andrew On Thu, Sep 25, 2014 at 4:20 PM, SK <skrishna...@gmail.com> wrote: > Hi, > > I am using Spark 1.1.0 on a cluster. My job takes as input 30 files in a > directory (I am using sc.textfile("dir/*") ) to read in the files. I am > getting the following warning: > > WARN TaskSetManager: Lost task 99.0 in stage 1.0 (TID 99, > mesos12-dev.sccps.net): java.io.FileNotFoundException: > /tmp/spark-local-20140925215712-0319/12/shuffle_0_99_93138 (Too many open > files) > > basically I think a lot of shuffle files are being created. > > 1) The tasks eventually fail and the job just hangs (after taking very > long, > more than an hour). If I read these 30 files in a for loop, the same job > completes in a few minutes. However, I need to specify the files names, > which is not convenient. I am assuming that sc.textfile("dir/*") creates a > large RDD for all the 30 files. Is there a way to make the operation on > this > large RDD efficient so as to avoid creating too many shuffle files? > > > 2) Also, I am finding that all the shuffle files for my other completed > jobs > are not being automatically deleted even after days. I thought that > sc.stop() clears the intermediate files. Is there some way to > programmatically delete these temp shuffle files upon job completion? > > > thanks > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-files-tp15185.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >