[
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394368#comment-14394368
]
Sam Tunnicliffe commented on CASSANDRA-9092:
--------------------------------------------
What consistency level are you writing at?
How are your clients performing the writes, thrift or native protocol?
How do your clients balance requests? Are they simply sending them round robin
or using token aware routing? Are you writing in only one DC or to both?
Are there errors or warnings in the logs of the nodes which don't fail?
Also, I don't think the schema you posted is complete as the primary key
includes a {{chunk}} column not in the table definition.
If this is a not your regular workload (i.e. it's a periodic bulk load) and you
expect the normal usage pattern to be different, disabling hinted handoff
temporarily may be a reasonable workaround for you, provided you aren't relying
on CL.ANY and your clients handle {{UnavailableException}} sanely. You'll also
need to run repair after the load completes.
If that isn't an option, bumping the delivery threads and opening the throttle
might prevent a huge hints buildup if you have sufficient bandwidth and CPU,
but I doubt it will help much as the nodes or network are clearly already
overwhelmed otherwise there wouldn't be so many hints being written in the
first place.
> Nodes in DC2 die during and after huge write workload
> -----------------------------------------------------
>
> Key: CASSANDRA-9092
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
> Project: Cassandra
> Issue Type: Bug
> Environment: CentOS 6.2 64-bit, Cassandra 2.1.2,
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
> Reporter: Sergey Maznichenko
> Assignee: Sam Tunnicliffe
> Fix For: 2.1.5
>
> Attachments: cassandra_crash1.txt
>
>
> Hello,
> We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
> Node is VM 8 CPU, 32GB RAM
> During significant workload (loading several millions blobs ~3.5MB each), 1
> node in DC2 stops and after some time next 2 nodes in DC2 also stops.
> Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start.
> I see many files in system.hints table and error appears in 2-3 minutes after
> starting system.hints auto compaction.
> Stops, means "ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456
> CassandraDaemon.java:153 - Exception in thread
> Thread[CompactionExecutor:1,1,main]
> java.lang.OutOfMemoryError: Java heap space"
> ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 -
> Exception in thread Thread[HintedHandoff:1,1,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
> java.lang.OutOfMemoryError: Java heap space
> Full errors listing attached in cassandra_crash1.txt
> The problem exists only in DC2. We have 1GbE between DC1 and DC2.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)