[
https://issues.apache.org/jira/browse/CASSANDRA-15367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023358#comment-17023358
]
Blake Eggleston commented on CASSANDRA-15367:
---------------------------------------------
Yep, I think that would fix the problem. Another approach that wouldn’t have
the potential to introduce delays would be to skip locking if we have (or are
about to) set the final replay position on a memtable waiting on an op group.
Like setting blocking, but it won’t bypass the allocator in case the flush
queue is long. That would fix the deadlock without delaying later writes,
although it could increase contention.
Rough example with lazy naming
[here|https://github.com/bdeggleston/cassandra/tree/15367-alternative-2]
It would be nice if a write waiting for a lock could unblock itself as soon as
it's op group becomes blocking
Random thoughts about longer term fixes:
I didn’t have a chance to get my head around how you’d intended to remove the
lock completely, but I don’t understand how that could be done without
reintroducing the contention gc problem.
It seems to me that the root cause of all this is that we have 2 mechanisms for
ordering events (OpOrder and ReplayPosition) which are mostly independent, but
have to interact in non-deterministic ways during memtable flush, which creates
these edge cases. I think the right fix (or one of them) is to either merge
these two classes, or make one control the other.
> Memtable memory allocations may deadlock
> ----------------------------------------
>
> Key: CASSANDRA-15367
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15367
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Commit Log, Local/Memtable
> Reporter: Benedict Elliott Smith
> Assignee: Benedict Elliott Smith
> Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> * Under heavy contention, we guard modifications to a partition with a mutex,
> for the lifetime of the memtable.
> * Memtables block for the completion of all {{OpOrder.Group}} started before
> their flush began
> * Memtables permit operations from this cohort to fall-through to the
> following Memtable, in order to guarantee a precise commitLogUpperBound
> * Memtable memory limits may be lifted for operations in the first cohort,
> since they block flush (and hence block future memory allocation)
> With very unfortunate scheduling
> * A contended partition may rapidly escalate to a mutex
> * The system may reach memory limits that prevent allocations for the new
> Memtable’s cohort (C2)
> * An operation from C2 may hold the mutex when this occurs
> * Operations from a prior Memtable’s cohort (C1), for a contended partition,
> may fall-through to the next Memtable
> * The operations from C1 may execute after the above is encountered by those
> from C2
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]