[
https://issues.apache.org/jira/browse/SPARK-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zongheng Yang updated SPARK-2865:
---------------------------------
Description:
In the application I tested, most of the tasks out of 128 tasks could finish,
but sometimes (pretty deterministically) either 1 or 3 tasks would just hang
forever (> 5 hrs with no progress at all) with the following stack trace. There
were no apparent failures from the UI, also the nodes where the stuck tasks
were running had no apparent memory/CPU/disk pressures.
{noformat}
"Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800 nid=0xaac
waiting on condition [0x00007f33f4428000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f3e0d7198e8> (a
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
at
org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
at
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
This behavior does *not* appear on 1.0 (reusing the same cluster), but appears
on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this
patch|https://github.com/apache/spark/pull/1758], and it didn't fix the
behavior.
When this behavior happened, the driver printed out the following line
repeatedly:
{noformat}
14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager
BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no
recent heart beats: 67331ms exceeds 45000ms
{noformat}
was:
In the application I tested, most of the tasks out of 128 tasks could finish,
but sometimes (pretty deterministically) either 1 or 3 tasks would just hang
forever with the following stack trace. There were no apparent failures from
the UI, also the nodes where the stuck tasks were running had no apparent
memory/CPU/disk pressures.
{noformat}
"Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800 nid=0xaac
waiting on condition [0x00007f33f4428000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f3e0d7198e8> (a
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at
org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
at
org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
at
org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
at
org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
This behavior does *not* appear on 1.0 (reusing the same cluster), but appears
on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried out [this
patch|https://github.com/apache/spark/pull/1758], and it didn't fix the
behavior.
When this behavior happened, the driver printed out the following line
repeatedly:
{noformat}
14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager
BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with no
recent heart beats: 67331ms exceeds 45000ms
{noformat}
> Potential deadlock: tasks could hang forever waiting to fetch a remote block
> even though most tasks finish
> ----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-2865
> URL: https://issues.apache.org/jira/browse/SPARK-2865
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, Spark Core
> Affects Versions: 1.0.1, 1.1.0
> Environment: 16-node EC2 r3.2xlarge cluster
> Reporter: Zongheng Yang
> Priority: Blocker
>
> In the application I tested, most of the tasks out of 128 tasks could finish,
> but sometimes (pretty deterministically) either 1 or 3 tasks would just hang
> forever (> 5 hrs with no progress at all) with the following stack trace.
> There were no apparent failures from the UI, also the nodes where the stuck
> tasks were running had no apparent memory/CPU/disk pressures.
> {noformat}
> "Executor task launch worker-0" daemon prio=10 tid=0x00007f32ec003800
> nid=0xaac waiting on condition [0x00007f33f4428000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x00007f3e0d7198e8> (a
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at
> org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
> at
> org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
> at
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
> at
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
> at
> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This behavior does *not* appear on 1.0 (reusing the same cluster), but
> appears on the master branch as of Aug 4, 2014 *and* 1.0.1. Further, I tried
> out [this patch|https://github.com/apache/spark/pull/1758], and it didn't fix
> the behavior.
> When this behavior happened, the driver printed out the following line
> repeatedly:
> {noformat}
> 14/08/04 23:32:42 WARN storage.BlockManagerMasterActor: Removing BlockManager
> BlockManagerId(7, ip-172-31-6-74.us-west-1.compute.internal, 59408, 0) with
> no recent heart beats: 67331ms exceeds 45000ms
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]