[ 
https://issues.apache.org/jira/browse/CASSANDRA-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610580#comment-14610580
 ] 

Constance Eustace commented on CASSANDRA-9640:
----------------------------------------------

Crap, those were the wrong ones. That was a strange hammer of Old Gen GC that 
occurred for unknown reasons and then an OOM collapse.



Yes the batches are occuring in our persistence layer and are being addressed. 
As you can see the batch is a whopping 6K rather than 5K, but that's not the GC 
pressure source.

The logs have rolled already, so they are gone. Our old gen was 8GB, and it 
would steadily rise to fill that during the GC what appeared to be during the 
repair of the table with a small number of very wide (some 10GB) rows, then a 
series of expensive Old Gen G1 gc collections that would take 10-30 seconds 
each, and then an OOM collapse. These old gen GCs would not recover very much 
(1-5% IIRC) of the old gen, so it would start to "thrash" frequently, about 
1/min, which means the node was practically completely unavailable 50% of the 
time. 

Proactive truncation of those rows seems to have solved it in the practical 
sense. If I get time I'll attempt an isolated cluster reproduction in a lower 
environment



> Nodetool repair of very wide, large rows causes GC pressure and 
> destabilization
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-9640
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9640
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: AWS, ~8GB heap
>            Reporter: Constance Eustace
>            Priority: Minor
>             Fix For: 2.1.x
>
>         Attachments: syslog.zip
>
>
> We've noticed our nodes becoming unstable with large, unrecoverable Old Gen 
> GCs until OOM.
> This appears to be around the time of repair, and the specific cause seems to 
> be one of our report computation tables that involves possible very wide rows 
> with 10GB of data in it. THis is an RF 3 table in a four-node cluster.
> We truncate this occasionally, and we also had disabled this computation 
> report for a bit and noticed better node stabiliy.
> I wish I had more specifics. We are switching to an RF 1 table and do more 
> proactive truncation of the table. 
> When things calm down, we will attempt to replicate the issue and watch GC 
> and other logs.
> Any suggestion for things to look for/enable tracing on would be welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to