[ 
https://issues.apache.org/jira/browse/CASSANDRA-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600078#comment-14600078
 ] 

Constance Eustace edited comment on CASSANDRA-9640 at 6/24/15 8:52 PM:
-----------------------------------------------------------------------

attached syslog.zip....

the destabilization occurs around the end of _0040 and continues into the next 
logfile

I suspect that multiple huge/wide partition keys are being resolved in 
parallel, and that may be filling the heap, since a smaller set of large wide 
rows (1-10 rows) doesn't seem to bother it.

I'd guess we had 20-40 rows fo 5-10 GB each when this went teetering down.

entity_etljob is the processing table that has the ultra-huge rows



was (Author: cowardlydragon):
attached syslog.zip....

the destabilization occurs around the end of _0040 and continues into the next 
logfile

I suspect that multiple huge/wide partition keys are being resolved in 
parallel, and that may be filling the heap, since a smaller set of large wide 
rows (1-10 rows) doesn't seem to bother it.

I'd guess we had 20-30 days of rows...

entity_etljob is the processing table that has the ultra-huge rows


> Nodetool repair of very wide, large rows causes GC pressure and 
> destabilization
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-9640
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9640
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: AWS, ~8GB heap
>            Reporter: Constance Eustace
>            Assignee: Yuki Morishita
>            Priority: Minor
>             Fix For: 2.1.x
>
>         Attachments: syslog.zip
>
>
> We've noticed our nodes becoming unstable with large, unrecoverable Old Gen 
> GCs until OOM.
> This appears to be around the time of repair, and the specific cause seems to 
> be one of our report computation tables that involves possible very wide rows 
> with 10GB of data in it. THis is an RF 3 table in a four-node cluster.
> We truncate this occasionally, and we also had disabled this computation 
> report for a bit and noticed better node stabiliy.
> I wish I had more specifics. We are switching to an RF 1 table and do more 
> proactive truncation of the table. 
> When things calm down, we will attempt to replicate the issue and watch GC 
> and other logs.
> Any suggestion for things to look for/enable tracing on would be welcome.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to