Hong Shen created SPARK-4909:
--------------------------------
Summary: "Error communicating with MapOutputTracker" when run a
big spark job
Key: SPARK-4909
URL: https://issues.apache.org/jira/browse/SPARK-4909
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen
When I run a job spark job with 38788 mapTask and 997 reduceTask, Job failed.
Here is the log.
14/12/20 15:11:18 ERROR spark.MapOutputTrackerWorker: Error communicating with
MapOutputTracker
java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:109)
at
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:162)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:43)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:41)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:117)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:293)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:260)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:293)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:260)
at
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:114)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:293)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:260)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:293)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:260)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:293)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:260)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:293)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:260)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
The mapOutputStatus is more than 15MB, and more than 500 executor ask driver to
send map output locations for shuffle, and driver will send map output
locations to all the executors, it's obviously will cause executor timeout.
Maybe we can optimize it, do not let driver send map output locations to all
the executors, for example, to use broadcast variable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]