Hello all . Does anyone else have any suggestions? Even understanding what this error is from would help a lot. On Oct 11, 2014 12:56 AM, "Ilya Ganelin" <ilgan...@gmail.com> wrote:
> Hi Akhil - I tried your suggestions and tried varying my partition sizes. > Reducing the number of partitions led to memory errors (presumably - I saw > IOExceptions much sooner). > > With the settings you provided the program ran for longer but ultimately > crashes in the same way. I would like to understand what is going on > internally leading to this. > > Could this be related to garbage collection? > On Oct 10, 2014 3:19 AM, "Akhil Das" <ak...@sigmoidanalytics.com> wrote: > >> You could be hitting this issue >> <https://issues.apache.org/jira/browse/SPARK-3633> (or similar). You can >> try the following workarounds: >> >> sc.set("spark.core.connection.ack.wait.timeout","600") >> sc.set("spark.akka.frameSize","50") >> Also reduce the number of partitions, you could be hitting the kernel's >> ulimit. I faced this issue and it was gone when i dropped the partitions >> from 1600 to 200. >> >> Thanks >> Best Regards >> >> On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin <ilgan...@gmail.com> wrote: >> >>> Hi all – I could use some help figuring out a couple of exceptions I’ve >>> been getting regularly. >>> >>> I have been running on a fairly large dataset (150 gigs). With smaller >>> datasets I don't have any issues. >>> >>> My sequence of operations is as follows – unless otherwise specified, I >>> am not caching: >>> >>> Map a 30 million row x 70 col string table to approx 30 mil x 5 string >>> (For read as textFile I am using 1500 partitions) >>> >>> From that, map to ((a,b), score) and reduceByKey, numPartitions = 180 >>> >>> Then, extract distinct values for A and distinct values for B. (I cache >>> the output of distinct), , numPartitions = 180 >>> >>> Zip with index for A and for B (to remap strings to int) >>> >>> Join remapped ids with original table >>> >>> This is then fed into MLLIBs ALS algorithm. >>> >>> I am running with: >>> >>> Spark version 1.02 with CDH5.1 >>> >>> numExecutors = 8, numCores = 14 >>> >>> Memory = 12g >>> >>> MemoryFration = 0.7 >>> >>> KryoSerialization >>> >>> My issue is that the code runs fine for a while but then will >>> non-deterministically crash with either file IOExceptions or the following >>> obscure error: >>> >>> 14/10/08 13:29:59 INFO TaskSetManager: Loss was due to >>> java.io.IOException: Filesystem closed [duplicate 10] >>> >>> 14/10/08 13:30:08 WARN TaskSetManager: Loss was due to >>> java.io.FileNotFoundException >>> >>> java.io.FileNotFoundException: >>> /opt/cloudera/hadoop/1/yarn/nm/usercache/zjb238/appcache/application_1412717093951_0024/spark-local-20141008131827-c082/1c/shuffle_3_117_354 >>> (No such file or directory) >>> >>> Looking through the logs, I see the IOException in other places but it >>> appears to be non-catastrophic. The FileNotFoundException, however, is. I >>> have found the following stack overflow that at least seems to address the >>> IOException: >>> >>> >>> http://stackoverflow.com/questions/24038908/spark-fails-on-big-shuffle-jobs-with-java-io-ioexception-filesystem-closed >>> >>> But I have not found anything useful at all with regards to the app >>> cache error. >>> >>> Any help would be much appreciated. >>> >> >>