[
https://issues.apache.org/jira/browse/CASSANDRA-15119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836903#comment-16836903
]
Brent commented on CASSANDRA-15119:
-----------------------------------
[~molsson] I will provide you with a heap dump (and a video of the repair
process) soon. During my previous attempt(s) to get the heap dump, no nodes
wanted to crash (attached via JProfiler) despite having only 0.25GB of 12GB
heap free without being able to clear much with GC. I tried making a manual
heap dump but then my test VM did not have enough disk space for it...
What's strange is that, nodetool repair said the repair failed, and it is only
after that that the node I was monitoring began producing large amounts of
memory on heap. Before that it was steady and low despite a lot of activity
still remaining. Besides the repair, on this test cluster there is no other
activity.
I think at the time the memory was so high, using
{noformat}
nodetool compactionstats{noformat}
I could see that 3 or 4 streaming entries were there and permanently on 100%.
That was like 15 minutes after the repair had already failed.
> Repair fails randomly, causing nodes to restart
> -----------------------------------------------
>
> Key: CASSANDRA-15119
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15119
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Repair, Consistency/Streaming
> Reporter: Brent
> Priority: Normal
>
> We have a cluster of 3 nodes (same dc) that is ~8GB on disk (per node). One
> keyspace has two tables, combined having about 20m rows with around 20 colums
> each. Whenever we try to run a repair (with or without cassandra-reaper, on
> any setting) the repair causes certain nodes to fail and restart. Originally
> these nodes had the default heap space calculation on a device with 12GB ram.
> We upscaled these to 24GB ram and 12GB XMX which seemed to make a difference
> but still not quite enough. With JProfiler we can see that random nodes reach
> the xmx limit, regardless of the size of the repair, while streaming data.
> I can't understand that such operations can cause servers to literally crash
> rather than just say "no I can't do it". We've tried a lot of things
> including setting up a fresh cluster and manually inserting all the data
> (with the correct replication factor) and then run repairs.
> Sometimes they will work (barely) sometimes they will fail. I really don't
> understand.
> We're running cassandra 3.11.4.
> Could I receive some assistance in troubleshooting this?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]