[
https://issues.apache.org/jira/browse/CASSANDRA-7533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14061506#comment-14061506
]
Matt Byrd commented on CASSANDRA-7533:
--------------------------------------
Just to add a bit more context, we had a single instance of Cassandra get
fairly stuck replaying commitlogs.
It was burning through 2000% cpu + for over four hours with no end in sight, so
we killed it removed commit logs brought it up and ran repair. (This was in q.a
thankfully)
The problem can easily be reproduce by just writing 100,000 cql row (range
deletes) to the same partition key, stopping Cassandra and starting it again.
I admit this is somewhat of an anti-pattern, but still quite a dramatic effect
from not very much data.
The problem exercised here is that:
1. We contend in the memtable to do this insert in a CAS loop.
2. the work done in this loop becomes ever more expensive as
RangeTombstoneList.dataSize is iterated over to compute the size.
Point 2. effectively fixed in 2.1 with all the off-heap allocation, the
dataSize calculation effectively becomes more online.
To resolve this problem in 2.0 you could also keep this tally of dataSize
online, or maybe start keeping it online once the list is sufficiently big to
cause a problem.
Doing this seemed to help a lot, but far simpler was just toggling the
concurrency of the commitlog replay, which can be achieved by lowering
MAX_OUTSTANDING_REPLAY_COUNT (in our case setting this to 1 seemed to help).
Thanks,
Matt
> Let MAX_OUTSTANDING_REPLAY_COUNT be configurable
> ------------------------------------------------
>
> Key: CASSANDRA-7533
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7533
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Jeremiah Jordan
> Assignee: Yuki Morishita
> Priority: Minor
> Fix For: 2.0.10
>
>
> There are some workloads where commit log replay will run into contention
> issues with multiple things updating the same partition. Through some
> testing it was found that lowering CommitLogReplayer.java
> MAX_OUTSTANDING_REPLAY_COUNT can help with this issue.
> The calculations added in CASSANDRA-6655 are one such place things get
> bottlenecked.
--
This message was sent by Atlassian JIRA
(v6.2#6252)