Brent created CASSANDRA-15119:
---------------------------------
Summary: Repair fails randomly, causing nodes to restart
Key: CASSANDRA-15119
URL: https://issues.apache.org/jira/browse/CASSANDRA-15119
Project: Cassandra
Issue Type: Bug
Components: Consistency/Repair, Consistency/Streaming
Reporter: Brent
We have a cluster of 3 nodes (same dc) that is ~8GB on disk (per node). One
keyspace has two tables, combined having about 20m rows with around 20 colums
each. Whenever we try to run a repair (with or without cassandra-reaper, on any
setting) the repair causes certain nodes to fail and restart. Originally these
nodes had the default heap space calculation on a device with 12GB ram.
We upscaled these to 24GB ram and 12GB XMX which seemed to make a difference
but still not quite enough. With JProfiler we can see that random nodes reach
the xmx limit, regardless of the size of the repair, while streaming data.
I can't understand that such operations can cause servers to literally crash
rather than just say "no I can't do it". We've tried a lot of things including
setting up a fresh cluster and manually inserting all the data (with the
correct replication factor) and then run repairs.
Sometimes they will work (barely) sometimes they will fail. I really don't
understand.
We're running cassandra 3.11.4.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]