Zongheng Yang created SPARK-2865:
------------------------------------
Summary: Potential deadlock: tasks could hang forever waiting to
fetch a remote block even though most tasks finish
Key: SPARK-2865
URL: https://issues.apache.org/jira/browse/SPARK-2865
Project: Spark
Issue Type: Bug
Components: Shuffle, Spark Core
Affects Versions: 1.0.1, 1.1.0
Environment: 16-node EC2 r3.2xlarge cluster
Reporter: Zongheng Yang
Priority: Blocker
In the application I tested, most of the tasks out of 128 tasks could finish,
but sometimes (pretty deterministically) either 1 or 3 tasks would just hang
forever with the following stack trace. There were no apparent failures from
the UI, also the nodes where the stuck tasks were running had no apparent
memory/CPU/disk pressures.
{noformat}
"Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800 nid=0xaac
waiting on condition [0x00007f33f4428000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f3e0d7198e8> (a
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
at
org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
at
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
This behavior does *not* appear on 1.0 (reusing the same cluster), but appears
on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this
patch|https://github.com/apache/spark/pull/1758], and it didn't fix the
behavior.
Further, when this behavior happened, the driver printed out the following line
repeatedly:
{noformat}
14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager
BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no
recent heart beats: 67331ms exceeds 45000ms
{noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]