[ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394368#comment-14394368
 ] 

Sam Tunnicliffe commented on CASSANDRA-9092:
--------------------------------------------

What consistency level are you writing at? 
How are your clients performing the writes, thrift or native protocol?
How do your clients balance requests? Are they simply sending them round robin 
or using token aware routing? Are you writing in only one DC or to both?
Are there errors or warnings in the logs of the nodes which don't fail? 

Also, I don't think the schema you posted is complete as the primary key 
includes a {{chunk}} column not in the table definition.

If this is a not your regular workload (i.e. it's a periodic bulk load) and you 
expect the normal usage pattern to be different, disabling hinted handoff 
temporarily may be a reasonable workaround for you, provided you aren't relying 
on CL.ANY and your clients handle {{UnavailableException}} sanely. You'll also 
need to run repair after the load completes. 
If that isn't an option, bumping the delivery threads and opening the throttle 
might prevent a huge hints buildup if you have sufficient bandwidth and CPU, 
but I doubt it will help much as the nodes or network are clearly already 
overwhelmed otherwise there wouldn't be so many hints being written in the 
first place. 

> Nodes in DC2 die during and after huge write workload
> -----------------------------------------------------
>
>                 Key: CASSANDRA-9092
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
>            Reporter: Sergey Maznichenko
>            Assignee: Sam Tunnicliffe
>             Fix For: 2.1.5
>
>         Attachments: cassandra_crash1.txt
>
>
> Hello,
> We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
> Node is VM 8 CPU, 32GB RAM
> During significant workload (loading several millions blobs ~3.5MB each), 1 
> node in DC2 stops and after some time next 2 nodes in DC2 also stops.
> Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
> I see many files in system.hints table and error appears in 2-3 minutes after 
> starting system.hints auto compaction.
> Stops, means "ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
> CassandraDaemon.java:153 - Exception in thread 
> Thread[CompactionExecutor:1,1,main]
> java.lang.OutOfMemoryError: Java heap space"
> ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - 
> Exception in thread Thread[HintedHandoff:1,1,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.OutOfMemoryError: Java heap space
> Full errors listing attached in cassandra_crash1.txt
> The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to