[
https://issues.apache.org/jira/browse/CASSANDRA-8639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033829#comment-15033829
]
T Jake Luciani commented on CASSANDRA-8639:
-------------------------------------------
This isn't related to the ticket but maybe we should fix it as well I don't see
anyplace we wait for the replace futures to complete before we finish recover().
Both 2.1 code and your patch will exit early before the futures have all
finished. It looks like the old version only waited when there were more than
max outstanding mutations. Which is also wrong and racy. We should always wait
for the queue to drain completely before the method exits.
I'm not sure why futures was changed to a deque. looks like you only use queue
methods, but maybe I missed it?
The only other thing I noticed was in the test you should validate the data
test data is not found after you clear the CF in-case the replay isn't working.
You also have a 2.1 utest failure related to CL not sure if that's related.
org.apache.cassandra.cql3.DropKeyspaceCommitLogRecycleTest.testRecycle
And one dtest failure in 2.1 commitlog_test.TestCommitLog.test_bad_crc
I didn't check the
> Can OOM on CL replay with dense mutations
> -----------------------------------------
>
> Key: CASSANDRA-8639
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8639
> Project: Cassandra
> Issue Type: Bug
> Components: Local Write-Read Paths
> Reporter: T Jake Luciani
> Assignee: Ariel Weisberg
> Priority: Minor
> Fix For: 2.1.x
>
>
> If you write dense mutations with many clustering keys, the replay of the CL
> can quickly overwhelm a node on startup. This looks to be caused by the fact
> we only ensure there are 1000 mutations in flight at a time. but those
> mutations could have thousands of cells in them.
> A better approach would be to limit the CL replay to the amount of memory in
> flight using cell.unsharedHeapSize()
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)