[ 
https://issues.apache.org/jira/browse/CASSANDRA-15367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016393#comment-17016393
 ] 

Blake Eggleston commented on CASSANDRA-15367:
---------------------------------------------

I've been trying to work out exactly how this deadlock can occur, based on your 
description. Could the deadlock be restated like this?

 
 For a given partition key:
 * a write is part of an OpGroup before a barrier set on Memtable1 (M1), but 
with a replay position after the final replay position set on M1 before it 
flushes.
 * So it’s forwarded to M2, while still blocking flushes on M1
 * M2 has another in flight write for this partition, it’s contended, so it’s 
holding the lock
 ** It can’t progress because it can’t allocate memory (in part because M1 
can’t flush)
 ** It doesn’t degrade to allocating on heap it’s oporder isn’t blocking 
anything.
 * The write stage becomes saturated with deadlocked writes like these, no more 
writes

 

 

> Memtable memory allocations may deadlock
> ----------------------------------------
>
>                 Key: CASSANDRA-15367
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15367
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Commit Log, Local/Memtable
>            Reporter: Benedict Elliott Smith
>            Assignee: Benedict Elliott Smith
>            Priority: Normal
>             Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> * Under heavy contention, we guard modifications to a partition with a mutex, 
> for the lifetime of the memtable.
> * Memtables block for the completion of all {{OpOrder.Group}} started before 
> their flush began
> * Memtables permit operations from this cohort to fall-through to the 
> following Memtable, in order to guarantee a precise commitLogUpperBound
> * Memtable memory limits may be lifted for operations in the first cohort, 
> since they block flush (and hence block future memory allocation)
> With very unfortunate scheduling
> * A contended partition may rapidly escalate to a mutex
> * The system may reach memory limits that prevent allocations for the new 
> Memtable’s cohort (C2) 
> * An operation from C2 may hold the mutex when this occurs
> * Operations from a prior Memtable’s cohort (C1), for a contended partition, 
> may fall-through to the next Memtable
> * The operations from C1 may execute after the above is encountered by those 
> from C2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to