[jira] [Updated] (CASSANDRA-15119) Repair fails randomly, causing nodes to restart

Brent (JIRA) Mon, 06 May 2019 07:50:06 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-15119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Brent updated CASSANDRA-15119:
------------------------------
    Description: 
We have a cluster of 3 nodes (same dc) that is ~8GB on disk (per node). One 
keyspace has two tables, combined having about 20m rows with around 20 colums 
each. Whenever we try to run a repair (with or without cassandra-reaper, on any 
setting) the repair causes certain nodes to fail and restart. Originally these 
nodes had the default heap space calculation on a device with 12GB ram.

We upscaled these to 24GB ram and 12GB XMX which seemed to make a difference 
but still not quite enough. With JProfiler we can see that random nodes reach 
the xmx limit, regardless of the size of the repair, while streaming data.

I can't understand that such operations can cause servers to literally crash 
rather than just say "no I can't do it". We've tried a lot of things including 
setting up a fresh cluster and manually inserting all the data (with the 
correct replication factor) and then run repairs.

Sometimes they will work (barely) sometimes they will fail. I really don't 
understand.

We're running cassandra 3.11.4.  

Could I receive some assistance in troubleshooting this?

  was:
We have a cluster of 3 nodes (same dc) that is ~8GB on disk (per node). One 
keyspace has two tables, combined having about 20m rows with around 20 colums 
each. Whenever we try to run a repair (with or without cassandra-reaper, on any 
setting) the repair causes certain nodes to fail and restart. Originally these 
nodes had the default heap space calculation on a device with 12GB ram.

We upscaled these to 24GB ram and 12GB XMX which seemed to make a difference 
but still not quite enough. With JProfiler we can see that random nodes reach 
the xmx limit, regardless of the size of the repair, while streaming data.

I can't understand that such operations can cause servers to literally crash 
rather than just say "no I can't do it". We've tried a lot of things including 
setting up a fresh cluster and manually inserting all the data (with the 
correct replication factor) and then run repairs.

Sometimes they will work (barely) sometimes they will fail. I really don't 
understand.

We're running cassandra 3.11.4.  


> Repair fails randomly, causing nodes to restart
> -----------------------------------------------
>
>                 Key: CASSANDRA-15119
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15119
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair, Consistency/Streaming
>            Reporter: Brent
>            Priority: Normal
>
> We have a cluster of 3 nodes (same dc) that is ~8GB on disk (per node). One 
> keyspace has two tables, combined having about 20m rows with around 20 colums 
> each. Whenever we try to run a repair (with or without cassandra-reaper, on 
> any setting) the repair causes certain nodes to fail and restart. Originally 
> these nodes had the default heap space calculation on a device with 12GB ram.
> We upscaled these to 24GB ram and 12GB XMX which seemed to make a difference 
> but still not quite enough. With JProfiler we can see that random nodes reach 
> the xmx limit, regardless of the size of the repair, while streaming data.
> I can't understand that such operations can cause servers to literally crash 
> rather than just say "no I can't do it". We've tried a lot of things 
> including setting up a fresh cluster and manually inserting all the data 
> (with the correct replication factor) and then run repairs.
> Sometimes they will work (barely) sometimes they will fail. I really don't 
> understand.
> We're running cassandra 3.11.4.  
> Could I receive some assistance in troubleshooting this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-15119) Repair fails randomly, causing nodes to restart

Reply via email to