[
https://issues.apache.org/jira/browse/CASSANDRA-15367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016393#comment-17016393
]
Blake Eggleston commented on CASSANDRA-15367:
---------------------------------------------
I've been trying to work out exactly how this deadlock can occur, based on your
description. Could the deadlock be restated like this?
For a given partition key:
* a write is part of an OpGroup before a barrier set on Memtable1 (M1), but
with a replay position after the final replay position set on M1 before it
flushes.
* So it’s forwarded to M2, while still blocking flushes on M1
* M2 has another in flight write for this partition, it’s contended, so it’s
holding the lock
** It can’t progress because it can’t allocate memory (in part because M1
can’t flush)
** It doesn’t degrade to allocating on heap it’s oporder isn’t blocking
anything.
* The write stage becomes saturated with deadlocked writes like these, no more
writes
> Memtable memory allocations may deadlock
> ----------------------------------------
>
> Key: CASSANDRA-15367
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15367
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Commit Log, Local/Memtable
> Reporter: Benedict Elliott Smith
> Assignee: Benedict Elliott Smith
> Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> * Under heavy contention, we guard modifications to a partition with a mutex,
> for the lifetime of the memtable.
> * Memtables block for the completion of all {{OpOrder.Group}} started before
> their flush began
> * Memtables permit operations from this cohort to fall-through to the
> following Memtable, in order to guarantee a precise commitLogUpperBound
> * Memtable memory limits may be lifted for operations in the first cohort,
> since they block flush (and hence block future memory allocation)
> With very unfortunate scheduling
> * A contended partition may rapidly escalate to a mutex
> * The system may reach memory limits that prevent allocations for the new
> Memtable’s cohort (C2)
> * An operation from C2 may hold the mutex when this occurs
> * Operations from a prior Memtable’s cohort (C1), for a contended partition,
> may fall-through to the next Memtable
> * The operations from C1 may execute after the above is encountered by those
> from C2
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]