[
https://issues.apache.org/jira/browse/CASSANDRA-15119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834650#comment-16834650
]
Marcus Olsson commented on CASSANDRA-15119:
-------------------------------------------
{quote}This does indeed seem promising (can't wait for 3.11.5 to try it)
however, in the linked ticket, the reporter gets an error that it goes out of
memory. We don't. We simply see that it gets dangerously close to the XMX limit
in JProfiler and suddenly one or more nodes are down, quickly restarting
afterwards. The only errors given is that streams failed.
{quote}
The ticket was created for Cassandra 3.11.1 and since CASSANDRA-13006 (3.11.2)
OOMs are handled by the JVM rather than by Cassandra. Do you think this could
be what you are experiencing and why there is a difference? If you do not have
heap dumps enabled could you try enabling them to see if one would get
generated when this happens (I suspect this could be the case since the nodes
are restarting)? If you then also use a tool like Eclipse MAT to analyze one of
the heap dumps it can generate a _leak suspect report_ which can be quite
helpful to investigate what is causing it.
Are you running repairs in parallel or do you run it on one node at a time?
CASSANDRA-14096 should cause most stress for the repair coordinating node as it
stores merkle trees for all the replicas. So if you are running repair on one
node but the other nodes are going down it could be something different.
bq. and can see that it would be busy streaming data to other nodes and
suddenly they go down. We also sometimes see validation compaction happening
but I believe it stops streaming then.
Part of CASSANDRA-14096 is about reducing the amount of time the MerkleTrees
are on-heap. Before the fix these trees could be stored until the full keyspace
was repaired (including during streaming). So it is possible be that the
streaming + other potential activities in the cluster in combination with these
stored MerkleTrees could tip it over.
> Repair fails randomly, causing nodes to restart
> -----------------------------------------------
>
> Key: CASSANDRA-15119
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15119
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Repair, Consistency/Streaming
> Reporter: Brent
> Priority: Normal
>
> We have a cluster of 3 nodes (same dc) that is ~8GB on disk (per node). One
> keyspace has two tables, combined having about 20m rows with around 20 colums
> each. Whenever we try to run a repair (with or without cassandra-reaper, on
> any setting) the repair causes certain nodes to fail and restart. Originally
> these nodes had the default heap space calculation on a device with 12GB ram.
> We upscaled these to 24GB ram and 12GB XMX which seemed to make a difference
> but still not quite enough. With JProfiler we can see that random nodes reach
> the xmx limit, regardless of the size of the repair, while streaming data.
> I can't understand that such operations can cause servers to literally crash
> rather than just say "no I can't do it". We've tried a lot of things
> including setting up a fresh cluster and manually inserting all the data
> (with the correct replication factor) and then run repairs.
> Sometimes they will work (barely) sometimes they will fail. I really don't
> understand.
> We're running cassandra 3.11.4.
> Could I receive some assistance in troubleshooting this?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]