I meant spark.default.parallelism of course. On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber <thomas.ger...@radius.com> wrote:
> Follow up: > We re-retried, this time after *decreasing* spark.parallelism. It was set > to 16000 before, (5 times the number of cores in our cluster). It is now > down to 6400 (2 times the number of cores). > > And it got past the point where it failed before. > > Does the MapOutputTracker have a limit on the number of tasks it can track? > > > On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber <thomas.ger...@radius.com> > wrote: > >> Hello, >> >> We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge >> workers). We use spark-submit to start an application. >> >> We got the following error which leads to a failed stage: >> >> Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, >> most recent failure: Lost task 3095.3 in stage 140.0 (TID 308697, >> ip-10-0-12-88.ec2.internal): org.apache.spark.SparkException: Error >> communicating with MapOutputTracker >> >> >> We tried the whole application again, and it failed on the same stage >> (but it got more tasks completed on that stage) with the same error. >> >> We then looked at executors stderr, and all show similar logs, on both >> runs (see below). As far as we can tell, executors and master have disk >> space left. >> >> *Any suggestion on where to look to understand why the communication with >> the MapOutputTracker fails?* >> >> Thanks >> Thomas >> ==== >> In case it matters, our akka settings: >> spark.akka.frameSize 50 >> spark.akka.threads 8 >> // those below are 10* the default, to cope with large GCs >> spark.akka.timeout 1000 >> spark.akka.heartbeat.pauses 60000 >> spark.akka.failure-detector.threshold 3000.0 >> spark.akka.heartbeat.interval 10000 >> >> Appendix: executor logs, where it starts going awry >> >> 15/03/04 11:45:00 INFO CoarseGrainedExecutorBackend: Got assigned task 298525 >> 15/03/04 11:45:00 INFO Executor: Running task 3083.0 in stage 140.0 (TID >> 298525) >> 15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(1473) called with >> curMem=5543008799, maxMem=18127202549 >> 15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339_piece0 stored as >> bytes in memory (estimated size 1473.0 B, free 11.7 GB) >> 15/03/04 11:45:00 INFO BlockManagerMaster: Updated info of block >> broadcast_339_piece0 >> 15/03/04 11:45:00 INFO TorrentBroadcast: Reading broadcast variable 339 took >> 224 ms >> 15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(2536) called with >> curMem=5543010272, maxMem=18127202549 >> 15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339 stored as values in >> memory (estimated size 2.5 KB, free 11.7 GB) >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Doing the fetch; tracker >> actor = >> Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:52380/user/MapOutputTracker#-2057016370] >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for >> shuffle 18, fetching them >> 15/03/04 11:45:30 ERROR MapOutputTrackerWorker: Error communicating with >> MapOutputTracker >> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] >> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) >> at >> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) >> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) >> at >> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) >> at scala.concurrent.Await$.result(package.scala:107) >> at >> org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:112) >> at >> org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:163) >> at >> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) >> at >> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) >> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) >> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:56) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> 15/03/04 11:45:30 INFO MapOutputTrackerWorker: Doing the fetch; tracker >> actor = >> Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:52380/user/MapOutputTracker#-2057016370] >> 15/03/04 11:45:30 ERROR Executor: Exception in task 32.0 in stage 140.0 (TID >> 295474) >> org.apache.spark.SparkException: Error communicating with MapOutputTracker >> at >> org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:116) >> at >> org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:163) >> at >> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) >> >> === >> and then later a lot of those: >> === >> >> 15/03/04 11:51:50 ERROR TransportRequestHandler: Error sending result >> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=29906093434, >> chunkIndex=25}, >> buffer=FileSegmentManagedBuffer{file=/mnt/spark/spark-3f8c4cbe-a1f8-4a66-ac17-0a3d3daaffaf/spark-92cb6108-35af-4ad0-82f6-ac904b677eff/spark-8fc6043c-df95-4c48-9215-5b9907014b55/spark-99219c49-778b-4b5f-8454-24d2d3b82b81/0d/shuffle_18_6718_0.data, >> offset=182070, length=166}} to /10.0.12.24:33174; closing connection >> java.nio.channels.ClosedChannelException >> >> >