[
https://issues.apache.org/jira/browse/CASSANDRA-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stanislav Vishnevskiy updated CASSANDRA-13687:
----------------------------------------------
Attachment: 3.0.9heap.png
3.0.14heap.png
3.0.14cpu.png
We just had this happen again. I am attaching screenshots of similar time range
again from before and after.
As you can see in this [^3.0.14heap.png] image at 1PM the heap spikes to 6GB,
then we have to take down the node cause it makes the cluster start failing. We
then proceed to change MAX_HEAP_SIZE to 12GB and bring it up again and repair.
This time it spikes to 8GB and sticks there though the whole repair. It then
drops down to 600MB without a huge CMS almost like it was 1 big object. The
node calling repair (1-1) is the only one with the heap growth. If you look at
[^3.0.9heap.png] this used to not occur during repair and all nodes looked
similar.
Another interesting thing is CPU usage as seen in [^3.0.14cpu.png]. The node
performing the node tool repair (in blue) is using way more CPU than the other
node in the cluster. We compared this a week ago with 3.0.9 and this was also
not true.
This feels like a bug in repair?
> Abnormal heap growth and long GC during repair.
> -----------------------------------------------
>
> Key: CASSANDRA-13687
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13687
> Project: Cassandra
> Issue Type: Bug
> Reporter: Stanislav Vishnevskiy
> Attachments: 3.0.14cpu.png, 3.0.14heap.png, 3.0.14.png,
> 3.0.9heap.png, 3.0.9.png
>
>
> We recently upgraded from 3.0.9 to 3.0.14 to get the fix from CASSANDRA-13004
> Sadly 3 out of the last 7 nights we have had to wake up due Cassandra dying
> on us. We currently don't have any data to help reproduce this, but maybe
> since there aren't many commits between the 2 versions it might be obvious.
> Basically we trigger a parallel incremental repair from a single node every
> night at 1AM. That node will sometimes start allocating a lot and keeping the
> heap maxed and triggering GC. Some of these GC can last up to 2 minutes. This
> effectively destroys the whole cluster due to timeouts to this node.
> The only solution we currently have is to drain the node and restart the
> repair, it has worked fine the second time every time.
> I attached heap charts from 3.0.9 and 3.0.14 during repair.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]