[ 
https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830770#comment-16830770
 ] 

Xingbo Jiang commented on SPARK-18406:
--------------------------------------

This problem still exists in PythonRunner, since python side uses a pre-fetch 
model to consume the upstream data, and open another thread to serve output 
data to downstream operators, thus it's possible the Task finishes first and 
trigger the task cleanup logic, and then the CompletionIterator try to release 
the write lock it holds on some blocks and found the lock has already been 
released. I'll submit a PR to bypass the issue later.

> Race between end-of-task and completion iterator read lock release
> ------------------------------------------------------------------
>
>                 Key: SPARK-18406
>                 URL: https://issues.apache.org/jira/browse/SPARK-18406
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager, Spark Core
>    Affects Versions: 2.0.0, 2.0.1
>            Reporter: Josh Rosen
>            Assignee: Xingbo Jiang
>            Priority: Major
>             Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> The following log comes from a production streaming job where executors 
> periodically die due to uncaught exceptions during block release:
> {code}
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921
> 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922
> 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923
> 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable 
> 2721
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924
> 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924)
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as 
> bytes in memory (estimated size 5.0 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took 
> 3 ms
> 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in 
> memory (estimated size 9.4 KB, free 4.9 GB)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = 
> 567, finish = 1
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = 
> 541, finish = 6
> 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID 
> 7923). 1429 bytes result sent to driver
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = 
> 533, finish = 7
> 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID 
> 7924). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID 
> 7921)
> java.lang.AssertionError: assertion failed
>       at scala.Predef$.assert(Predef.scala:165)
>       at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>       at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>       at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>       at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>       at scala.Option.foreach(Option.scala:236)
>       at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>       at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>       at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>       at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925
> 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925)
> 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally
> 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = 
> 576, finish = 1
> 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID 
> 7922). 1429 bytes result sent to driver
> 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for 
> /databricks/python/bin/python
> java.lang.AssertionError: assertion failed: Block rdd_2741_1 is not locked 
> for reading
>       at scala.Predef$.assert(Predef.scala:179)
>       at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:294)
>       at 
> org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:630)
>       at 
> org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:434)
>       at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
>       at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
>       at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>       at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
>       at 
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
>       at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1882)
>       at 
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
> 16/11/07 17:11:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[stdout writer for /databricks/python/bin/python,5,main]
> java.lang.AssertionError: assertion failed: Block rdd_2741_1 is not locked 
> for reading
>       at scala.Predef$.assert(Predef.scala:179)
>       at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:294)
>       at 
> org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:630)
>       at 
> org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:434)
>       at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46)
>       at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35)
>       at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>       at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:504)
>       at 
> org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:328)
>       at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1882)
>       at 
> org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala
> {code}
> I think that there's some sort of internal race condition between a task 
> finishing (TID 7921) and automatically releasing locks and between some 
> "automatically release locks on hitting the end of an iterator" logic running 
> in a separate thread. The log above came from a production streaming job 
> where executors periodically died with this type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to