[
https://issues.apache.org/jira/browse/CASSANDRA-15367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022906#comment-17022906
]
Benedict Elliott Smith commented on CASSANDRA-15367:
----------------------------------------------------
So, I decided to start writing a version of your approach with slightly more
explicit control flow. However, I realised that this bug is not fixed by this
approach, or my original approach.
The issue is that we have all been assuming there is only one table on the
system. In fact, the flushing {{Memtable}} that's waiting for the operation to
complete may be in an altogether different table. It might be that the
operation holding the lock and the operation that needs to obtain the lock are
both members are the same logical cohort for this {{Memtable}}.
We _could_ try to introduce a separate {{OpOrder}} per table, but this causes
its own issues, since we can have multiple tables in a single operation, each
one with its own different blocking behaviour. I don't want to think about
what bugs we might introduce there.
We could explicitly order operations by their {{OpOrder.Group}} when acquiring
a lock - if pessimistic locking is required, we wait for all earlier operations
to complete before we acquire the lock. I'm not sure what impact this might
have on the system, as this might introduce delays for these operations.
Alternatively, we really do need the follow-up work I've done recently to
remove the lock entirely. This is a significant amount of work, but has no
real caveats.
> Memtable memory allocations may deadlock
> ----------------------------------------
>
> Key: CASSANDRA-15367
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15367
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Commit Log, Local/Memtable
> Reporter: Benedict Elliott Smith
> Assignee: Benedict Elliott Smith
> Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> * Under heavy contention, we guard modifications to a partition with a mutex,
> for the lifetime of the memtable.
> * Memtables block for the completion of all {{OpOrder.Group}} started before
> their flush began
> * Memtables permit operations from this cohort to fall-through to the
> following Memtable, in order to guarantee a precise commitLogUpperBound
> * Memtable memory limits may be lifted for operations in the first cohort,
> since they block flush (and hence block future memory allocation)
> With very unfortunate scheduling
> * A contended partition may rapidly escalate to a mutex
> * The system may reach memory limits that prevent allocations for the new
> Memtable’s cohort (C2)
> * An operation from C2 may hold the mutex when this occurs
> * Operations from a prior Memtable’s cohort (C1), for a contended partition,
> may fall-through to the next Memtable
> * The operations from C1 may execute after the above is encountered by those
> from C2
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]