I meant spark.default.parallelism of course.

On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber <thomas.ger...@radius.com>
wrote:

> Follow up:
> We re-retried, this time after *decreasing* spark.parallelism. It was set
> to 16000 before, (5 times the number of cores in our cluster). It is now
> down to 6400 (2 times the number of cores).
>
> And it got past the point where it failed before.
>
> Does the MapOutputTracker have a limit on the number of tasks it can track?
>
>
> On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber <thomas.ger...@radius.com>
> wrote:
>
>> Hello,
>>
>> We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge
>> workers). We use spark-submit to start an application.
>>
>> We got the following error which leads to a failed stage:
>>
>> Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, 
>> most recent failure: Lost task 3095.3 in stage 140.0 (TID 308697, 
>> ip-10-0-12-88.ec2.internal): org.apache.spark.SparkException: Error 
>> communicating with MapOutputTracker
>>
>>
>> We tried the whole application again, and it failed on the same stage
>> (but it got more tasks completed on that stage) with the same error.
>>
>> We then looked at executors stderr, and all show similar logs, on both
>> runs (see below). As far as we can tell, executors and master have disk
>> space left.
>>
>> *Any suggestion on where to look to understand why the communication with
>> the MapOutputTracker fails?*
>>
>> Thanks
>> Thomas
>> ====
>> In case it matters, our akka settings:
>> spark.akka.frameSize 50
>> spark.akka.threads 8
>> // those below are 10* the default, to cope with large GCs
>> spark.akka.timeout 1000
>> spark.akka.heartbeat.pauses 60000
>> spark.akka.failure-detector.threshold 3000.0
>> spark.akka.heartbeat.interval 10000
>>
>> Appendix: executor logs, where it starts going awry
>>
>> 15/03/04 11:45:00 INFO CoarseGrainedExecutorBackend: Got assigned task 298525
>> 15/03/04 11:45:00 INFO Executor: Running task 3083.0 in stage 140.0 (TID 
>> 298525)
>> 15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(1473) called with 
>> curMem=5543008799, maxMem=18127202549
>> 15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339_piece0 stored as 
>> bytes in memory (estimated size 1473.0 B, free 11.7 GB)
>> 15/03/04 11:45:00 INFO BlockManagerMaster: Updated info of block 
>> broadcast_339_piece0
>> 15/03/04 11:45:00 INFO TorrentBroadcast: Reading broadcast variable 339 took 
>> 224 ms
>> 15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(2536) called with 
>> curMem=5543010272, maxMem=18127202549
>> 15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339 stored as values in 
>> memory (estimated size 2.5 KB, free 11.7 GB)
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
>> actor = 
>> Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:52380/user/MapOutputTracker#-2057016370]
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:30 ERROR MapOutputTrackerWorker: Error communicating with 
>> MapOutputTracker
>> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>>      at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>      at 
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>      at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>      at 
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>      at scala.concurrent.Await$.result(package.scala:107)
>>      at 
>> org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:112)
>>      at 
>> org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:163)
>>      at 
>> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
>>      at 
>> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
>>      at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
>>      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>      at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>      at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>>      at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>>      at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>      at 
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>      at org.apache.spark.scheduler.Task.run(Task.scala:56)
>>      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>      at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>      at java.lang.Thread.run(Thread.java:745)
>> 15/03/04 11:45:30 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
>> actor = 
>> Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:52380/user/MapOutputTracker#-2057016370]
>> 15/03/04 11:45:30 ERROR Executor: Exception in task 32.0 in stage 140.0 (TID 
>> 295474)
>> org.apache.spark.SparkException: Error communicating with MapOutputTracker
>>      at 
>> org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:116)
>>      at 
>> org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:163)
>>      at 
>> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
>>
>> ===
>> and then later a lot of those:
>> ===
>>
>> 15/03/04 11:51:50 ERROR TransportRequestHandler: Error sending result 
>> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=29906093434, 
>> chunkIndex=25}, 
>> buffer=FileSegmentManagedBuffer{file=/mnt/spark/spark-3f8c4cbe-a1f8-4a66-ac17-0a3d3daaffaf/spark-92cb6108-35af-4ad0-82f6-ac904b677eff/spark-8fc6043c-df95-4c48-9215-5b9907014b55/spark-99219c49-778b-4b5f-8454-24d2d3b82b81/0d/shuffle_18_6718_0.data,
>>  offset=182070, length=166}} to /10.0.12.24:33174; closing connection
>> java.nio.channels.ClosedChannelException
>>
>>
>

Reply via email to