[ 
https://issues.apache.org/jira/browse/CASSANDRA-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083554#comment-16083554
 ] 

Stanislav Vishnevskiy commented on CASSANDRA-13687:
---------------------------------------------------

I am assuming you were referring to "Compacted partition maximum bytes", the 
largest one is 20MB. The good news is that one is probably going to be deleted 
later this week because we figured out a better way to deal with outlier users. 
That said 20MB is well below the recommended 100MB limit. I can't get anything 
off netstats currently, probably have to wait till it happens again.

The question though is why does this only happen on the node that is running 
the node repair command? If this was a streaming issue wouldn't other nodes 
also have this issue. Is there a specific bugfix that caused this behavior 
change? It seems really weird for a hotfix version bump change behavior this 
way and it is not documented anywhere.

We run incremental repairs every 24 hours, so it definitely was not behind.

> Abnormal heap growth and long GC during repair.
> -----------------------------------------------
>
>                 Key: CASSANDRA-13687
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13687
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Stanislav Vishnevskiy
>         Attachments: 3.0.14.png, 3.0.9.png
>
>
> We recently upgraded from 3.0.9 to 3.0.14 to get the fix from CASSANDRA-13004
> Sadly 3 out of the last 7 nights we have had to wake up due Cassandra dying 
> on us. We currently don't have any data to help reproduce this, but maybe 
> since there aren't many commits between the 2 versions it might be obvious.
> Basically we trigger a parallel incremental repair from a single node every 
> night at 1AM. That node will sometimes start allocating a lot and keeping the 
> heap maxed and triggering GC. Some of these GC can last up to 2 minutes. This 
> effectively destroys the whole cluster due to timeouts to this node.
> The only solution we currently have is to drain the node and restart the 
> repair, it has worked fine the second time every time.
> I attached heap charts from 3.0.9 and 3.0.14 during repair.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to