[
https://issues.apache.org/jira/browse/CASSANDRA-15119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834517#comment-16834517
]
Marcus Olsson commented on CASSANDRA-15119:
-------------------------------------------
I think CASSANDRA-14096 could be interesting to look at here, it should be
fixed in *3.11.5*. As you say you have ~20m rows that gives me an additional
indication for this as this could create large merkle trees.
You could try to run {{jmap -histo}} for the Cassandra process several times
during one of these repairs as that could reveal what type of objects are
building up. If you can see a lot of MerkleTree related objects
(MerkleTree$Inner/Leaf etc) at the top of the histogram with large amounts of
memory used this could be related to CASSANDRA-14096.
Also, are you using virtual nodes or single token per node?
*Note:*
I believe repair could start the next round of validation compactions
(MerkleTree creation) in parallel with streaming the data files which could
explain your observation of reaching Xmx during streaming.
> Repair fails randomly, causing nodes to restart
> -----------------------------------------------
>
> Key: CASSANDRA-15119
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15119
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Repair, Consistency/Streaming
> Reporter: Brent
> Priority: Normal
>
> We have a cluster of 3 nodes (same dc) that is ~8GB on disk (per node). One
> keyspace has two tables, combined having about 20m rows with around 20 colums
> each. Whenever we try to run a repair (with or without cassandra-reaper, on
> any setting) the repair causes certain nodes to fail and restart. Originally
> these nodes had the default heap space calculation on a device with 12GB ram.
> We upscaled these to 24GB ram and 12GB XMX which seemed to make a difference
> but still not quite enough. With JProfiler we can see that random nodes reach
> the xmx limit, regardless of the size of the repair, while streaming data.
> I can't understand that such operations can cause servers to literally crash
> rather than just say "no I can't do it". We've tried a lot of things
> including setting up a fresh cluster and manually inserting all the data
> (with the correct replication factor) and then run repairs.
> Sometimes they will work (barely) sometimes they will fail. I really don't
> understand.
> We're running cassandra 3.11.4.
> Could I receive some assistance in troubleshooting this?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]