[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354680#comment-16354680
 ] 

Imran Rashid commented on SPARK-19870:
--------------------------------------

[~eyalfa] my recollection is a bit rusty, but you're explanation sounds 
reasonable.  Pretty sure if SPARK–22083 is the cause, then you'll see some 
other exception on the executor before this (perhaps it even appears to be 
handled gracefully at that point), so you should take a look at executor logs.  
If there is no exception, probably not SPARK-22083.  still executor logs would 
be really helpful to figure out what it might be.  As you mentioned earlier, if 
this is reproducible, then its worth even turning on TRACE and just grabbing 
those logs.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> ------------------------------------------------------------
>
>                 Key: SPARK-19870
>                 URL: https://issues.apache.org/jira/browse/SPARK-19870
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager, Shuffle
>    Affects Versions: 2.0.2, 2.1.0
>         Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>            Reporter: Steven Ruppert
>            Priority: Major
>         Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x00007fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x00007fffb95f3000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
>         - waiting to lock <0x00000005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
>         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>         - locked <0x00000005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>         at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at org.apache.spark.scheduler.Task.run(Task.scala:99)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x00007fffd88d0000 nid=0x1022b8 in Object.wait() [0x00007fffb96f4000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:502)
>         at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
>         - locked <0x0000000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
>         at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
>         - locked <0x00000005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
>         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>         - locked <0x000000059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
>         at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>         at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>         at org.apache.spark.scheduler.Task.run(Task.scala:99)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> A full stack trace is attached, but those seem to be the offending threads.
> This happens across several different executors, and has persisted through 
> several runs of the same job across spark 2.1.0 and 2.0.2. I also tried 
> killing individual executors to "unstick" the job, to no avail.
> I haven't yet narrowed down the job itself to something publicly repeatable, 
> but hopefully the stacktraces are enough to start debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to