[jira] [Comment Edited] (SPARK-22083) When dropping multiple blocks to disk, Spark should release all locks on a failure

Imran Rashid (JIRA) Fri, 22 Sep 2017 11:36:25 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176902#comment-16176902
 ]


Imran Rashid edited comment on SPARK-22083 at 9/22/17 6:35 PM:
---------------------------------------------------------------

After another look at this, I'm actually not sure why we didn't see problems 
when we were only dropping one block.  Though there is a finally block in 
{{DiskStore.put()}}, that just calls {{DiskStore.remove()}}, not 
{{BlockInfoManager.removeBlock() / unlock()}}, so the lock should still be 
held.  I guess just luck that an executor task thread wasn't stuck trying to 
get the lock.

I feel like the lock management needs another review, there seems to be an 
implicit assumption that block management is always done by task threads, but 
its also done by netty threads as blocks get promoted to or evicted from memory.


was (Author: irashid):
After another look at this, I'm actually not sure why we didn't see problems 
when we were only dropping one block.  Though there is a finally block in 
{{DiskStore.put()}}, that just calls {{DiskStore.remove()}}, not 
{{BlockInfoManager.removeBlock() / unlock()}}, so the lock should still be 
held.  I guess just lock that an executor task thread wasn't stuck trying to 
get the lock.

I feel like the lock management needs another review, there seems to be an 
implicit assumption that block management is always done by task threads, but 
its also done by netty threads as blocks get promoted to or evicted from memory.

> When dropping multiple blocks to disk, Spark should release all locks on a 
> failure
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-22083
>                 URL: https://issues.apache.org/jira/browse/SPARK-22083
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager, Spark Core
>    Affects Versions: 2.1.1, 2.2.0
>            Reporter: Imran Rashid
>
> {{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all 
> the blocks it intends to evict | 
> https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
>   However, if there is an exception while dropping blocks, there is no 
> {{finally}} block to release all the locks.
> If there is only one block being dropped, this isn't a problem (probably).  
> Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
> dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
> {{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
> block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
>  which cleans up the locks.
> I ran into this from the serialization issue in SPARK-21928.  In that, a 
> netty thread ends up trying to evict some blocks from memory to disk, and 
> fails.  When there is only one block that needs to be evicted, and the error 
> occurs, there isn't any real problem; I assume that netty thread is dead, but 
> the executor threads seem fine.  However, in the cases where two blocks get 
> dropped, one task gets completely stuck.  Unfortunately I don't have a stack 
> trace from the stuck executor, but I assume it just waits forever on this 
> lock that never gets released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-22083) When dropping multiple blocks to disk, Spark should release all locks on a failure

Reply via email to