[
https://issues.apache.org/jira/browse/SPARK-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Rosen resolved SPARK-14055.
--------------------------------
Resolution: Fixed
Fix Version/s: 2.0.0
Issue resolved by pull request 11875
[https://github.com/apache/spark/pull/11875]
> AssertionError may happeneds if not unlock writeLock when doing 'removeBlock'
> method
> ------------------------------------------------------------------------------------
>
> Key: SPARK-14055
> URL: https://issues.apache.org/jira/browse/SPARK-14055
> Project: Spark
> Issue Type: Bug
> Components: Block Manager, Spark Core
> Affects Versions: 2.0.0
> Environment: Spark 2.0-SNAPSHOT
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 16 cores & 64G RAM / node
> Data Replication factor of 2
> Each Node has 1 Spark executors configured with 16 cores each and 40GB of RAM.
> Reporter: Ernest
> Assignee: Ernest
> Priority: Critical
> Fix For: 2.0.0
>
>
> We got the following log when running _LiveJournalPageRank_.
> {quote}
> 452823:16/03/21 19:28:47.444 TRACE BlockInfoManager: Task 1662 trying to
> acquire write lock for rdd_3_183
> 452825:16/03/21 19:28:47.445 TRACE BlockInfoManager: Task 1662 acquired write
> lock for rdd_3_183
> 456941:16/03/21 19:28:47.596 INFO BlockManager: Dropping block rdd_3_183 from
> memory
> 456943:16/03/21 19:28:47.597 DEBUG MemoryStore: Block rdd_3_183 of size
> 418784648 dropped from memory (free 3504141600)
> 457027:16/03/21 19:28:47.600 DEBUG BlockManagerMaster: Updated info of block
> rdd_3_183
> 457053:16/03/21 19:28:47.600 DEBUG BlockManager: Told master about block
> rdd_3_183
> 457082:16/03/21 19:28:47.602 TRACE BlockInfoManager: Task 1662 trying to
> remove block rdd_3_183
> 500373:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to put
> rdd_3_183
> 500374:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to
> acquire read lock for rdd_3_183
> 500375:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 trying to
> acquire write lock for rdd_3_183
> 500376:16/03/21 19:28:49.893 TRACE BlockInfoManager: Task 1681 acquired write
> lock for rdd_3_183
> 517257:16/03/21 19:28:56.299 INFO BlockInfoManager: ****** taskAttemptId is:
> 1662, info.writerTask is: 1681, blockID is: rdd_3_183 so AssertionError
> happeneds here*****
> 517258-16/03/21 19:28:56.299 ERROR Executor: Exception in task 177.0 in stage
> 10.0 (TID 1662)
> 517259-java.lang.AssertionError: assertion failed
> 517260- at scala.Predef$.assert(Predef.scala:151)
> 517261- at
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:356)
> 517262- at
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1$$anonfun$apply$1.apply(BlockInfoManager.scala:351)
> 517263- at scala.Option.foreach(Option.scala:257)
> 517264- at
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:351)
> 517265- at
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$1.apply(BlockInfoManager.scala:350)
> 517266- at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
> 517267- at
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:350)
> 517268- at
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:626)
> 517269- at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:238)
> {quote}
> When memory for RDD storage is not sufficient and have to evict several
> partitions, this _AssertionError_ may happened.
> For the above example, this is because while running _Task 1662_, several
> partition (including rdd_3_183) need to be evicted. So _Task 1662_ acquired
> read and write locks at first, then doing _dropBlock_ method in
> _MemoryStore.evictBlocksToFreeSpace_ and actually dropping _rdd_3_183_ from
> memory. The _newEffectiveStorageLevel.isValid_ is false, so we run into
> _BlockInfoManager.removeBlock_, but _writeLocksByTask_ is not update here.
> Unfortunately, _Task 1681_ is already started and needed to reproduce
> rdd\_3\_183 to produce it's target rdd here , and this task acquired write
> lock of rdd\_3\_183. When _Task 1662_ call _releaseAllLocksForTask_ at last,
> this _AssertionError_ occurs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]