[jira] [Commented] (CASSANDRA-15119) Repair fails randomly, causing nodes to restart

Brent (JIRA) Tue, 07 May 2019 03:10:53 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834608#comment-16834608
 ]


Brent commented on CASSANDRA-15119:
-----------------------------------

Hi [~molsson], Thanks for the reply!

This does indeed seem promising (can't wait for 3.11.5 to try it) however, in 
the linked ticket, the reporter gets an error that it goes out of memory. We 
don't. We simply see that it gets dangerously close to the XMX limit in 
JProfiler and suddenly one or more nodes are down, quickly restarting 
afterwards. The only errors given is that streams failed.

During the repair however, we are monitoring with 
{noformat}
nodetool compactionstats{noformat}
and can see that it would be busy streaming data to other nodes and suddenly 
they go down. We also sometimes see validation compaction happening but I 
believe it stops streaming then.

 

> Repair fails randomly, causing nodes to restart
> -----------------------------------------------
>
>                 Key: CASSANDRA-15119
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15119
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair, Consistency/Streaming
>            Reporter: Brent
>            Priority: Normal
>
> We have a cluster of 3 nodes (same dc) that is ~8GB on disk (per node). One 
> keyspace has two tables, combined having about 20m rows with around 20 colums 
> each. Whenever we try to run a repair (with or without cassandra-reaper, on 
> any setting) the repair causes certain nodes to fail and restart. Originally 
> these nodes had the default heap space calculation on a device with 12GB ram.
> We upscaled these to 24GB ram and 12GB XMX which seemed to make a difference 
> but still not quite enough. With JProfiler we can see that random nodes reach 
> the xmx limit, regardless of the size of the repair, while streaming data.
> I can't understand that such operations can cause servers to literally crash 
> rather than just say "no I can't do it". We've tried a lot of things 
> including setting up a fresh cluster and manually inserting all the data 
> (with the correct replication factor) and then run repairs.
> Sometimes they will work (barely) sometimes they will fail. I really don't 
> understand.
> We're running cassandra 3.11.4.  
> Could I receive some assistance in troubleshooting this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-15119) Repair fails randomly, causing nodes to restart

Reply via email to