[ 
https://issues.apache.org/jira/browse/CASSANDRA-15367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022906#comment-17022906
 ] 

Benedict Elliott Smith commented on CASSANDRA-15367:
----------------------------------------------------

So, I decided to start writing a version of your approach with slightly more 
explicit control flow.  However, I realised that this bug is not fixed by this 
approach, or my original approach.

The issue is that we have all been assuming there is only one table on the 
system.  In fact, the flushing {{Memtable}} that's waiting for the operation to 
complete may be in an altogether different table.  It might be that the 
operation holding the lock and the operation that needs to obtain the lock are 
both members are the same logical cohort for this {{Memtable}}. 

We _could_ try to introduce a separate {{OpOrder}} per table, but this causes 
its own issues, since we can have multiple tables in a single operation, each 
one with its own different blocking behaviour.  I don't want to think about 
what bugs we might introduce there.

We could explicitly order operations by their {{OpOrder.Group}} when acquiring 
a lock - if pessimistic locking is required, we wait for all earlier operations 
to complete before we acquire the lock.  I'm not sure what impact this might 
have on the system, as this might introduce delays for these operations.

Alternatively, we really do need the follow-up work I've done recently to 
remove the lock entirely.  This is a significant amount of work, but has no 
real caveats.

> Memtable memory allocations may deadlock
> ----------------------------------------
>
>                 Key: CASSANDRA-15367
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15367
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Commit Log, Local/Memtable
>            Reporter: Benedict Elliott Smith
>            Assignee: Benedict Elliott Smith
>            Priority: Normal
>             Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> * Under heavy contention, we guard modifications to a partition with a mutex, 
> for the lifetime of the memtable.
> * Memtables block for the completion of all {{OpOrder.Group}} started before 
> their flush began
> * Memtables permit operations from this cohort to fall-through to the 
> following Memtable, in order to guarantee a precise commitLogUpperBound
> * Memtable memory limits may be lifted for operations in the first cohort, 
> since they block flush (and hence block future memory allocation)
> With very unfortunate scheduling
> * A contended partition may rapidly escalate to a mutex
> * The system may reach memory limits that prevent allocations for the new 
> Memtable’s cohort (C2) 
> * An operation from C2 may hold the mutex when this occurs
> * Operations from a prior Memtable’s cohort (C1), for a contended partition, 
> may fall-through to the next Memtable
> * The operations from C1 may execute after the above is encountered by those 
> from C2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to