[
https://issues.apache.org/jira/browse/CASSANDRA-15367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017436#comment-17017436
]
Benedict Elliott Smith commented on CASSANDRA-15367:
----------------------------------------------------
For comparison, [this
patch|https://github.com/belliottsmith/cassandra/tree/15367-a2] addresses this
ticket by ensuring allocations only happen whilst the lock is not held. It
aims to reduce the necessity of locking, not just for this use case, without
removing it altogether.
So that the fast path is unaffected, we perform our first attempt to insert as
normal
Unlike before, we disable {{abortEarly}} for this first attempt, so that we
always construct a complete new tree
If we fail, we walk this new tree, looking for any remnants of the insert
These remnants are collected into a new insert containing only the parts that
were retained after resolving
This new insert contains only Memtable-allocated data, so we do not need to
copy anything next attempt
Future attempts to insert operate on this minimal copied version of the data,
this preventing the worst case scenario the lock was introduced for, namely
Memtable exhaustion
However, to minimise any performance regression, we retain the lock and
continue to perform the same waste tracking as before
If locking has been enabled for the partition, step 1 is skipped, and we
immediately copy the entire insert into the Memtable before obtaining the lock
The performance impact of this patch is still being comprehensively validated,
and the results will be posted in a few days. It is reasonable to expect that
there will be some slight performance penalty in some cases, and some
improvements in others.
> Memtable memory allocations may deadlock
> ----------------------------------------
>
> Key: CASSANDRA-15367
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15367
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Commit Log, Local/Memtable
> Reporter: Benedict Elliott Smith
> Assignee: Benedict Elliott Smith
> Priority: Normal
> Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> * Under heavy contention, we guard modifications to a partition with a mutex,
> for the lifetime of the memtable.
> * Memtables block for the completion of all {{OpOrder.Group}} started before
> their flush began
> * Memtables permit operations from this cohort to fall-through to the
> following Memtable, in order to guarantee a precise commitLogUpperBound
> * Memtable memory limits may be lifted for operations in the first cohort,
> since they block flush (and hence block future memory allocation)
> With very unfortunate scheduling
> * A contended partition may rapidly escalate to a mutex
> * The system may reach memory limits that prevent allocations for the new
> Memtable’s cohort (C2)
> * An operation from C2 may hold the mutex when this occurs
> * Operations from a prior Memtable’s cohort (C1), for a contended partition,
> may fall-through to the next Memtable
> * The operations from C1 may execute after the above is encountered by those
> from C2
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]