[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

Sam Tunnicliffe (JIRA) Fri, 03 Apr 2015 08:35:23 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394571#comment-14394571
 ]


Sam Tunnicliffe commented on CASSANDRA-9092:
--------------------------------------------


Really, I think that the answer is likely to be that your cluster in 
underpowered for this particular workload and the build up of hints is a 
symptom of that. Setting {{hinted_handoff_enabled: false}} during the load will 
obviously stop that build up, but you're still going to see failures if the 
nodes can't keep up with the workload. 

One thing that puzzles me is that you say you only write to nodes in DC1, but 
you're seeing the hints build up in DC2. Hints are only written on the 
coordinator, so I suspect that somehow writes are being sent to all the nodes. 
Do you see hints being written on the DC1 nodes too?

Regarding hints storage, the plan is to stop writing them to a system table and 
instead use a log file. Obviously, this will remove a lot of overhead (i.e. no 
compaction required) so will be much more efficient (see CASSANDRA-6230). As 
for 2.1, as far as I'm aware there's nothing planned at the moment, and any 
large or invasive changes are not likely to make it into 2.1 this late in the 
lifetime of the release.  

Finally, I notice that you have authentication enabled as that second timeout 
is coming when C* is verifying the supplied credentials. That particular 
stacktrace indicates a thrift connection, whereas the first one is from a 
native CQL client. So I have 2 questions related to that:
 * Do you have multiple clients connecting (could be management tools like 
OpsCenter Agent)? 
 * Is that error, the one related to Auth, repeated frequently or are there 
more of the netty connection errors?


> Nodes in DC2 die during and after huge write workload
> -----------------------------------------------------
>
>                 Key: CASSANDRA-9092
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9092
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: CentOS 6.2 64-bit, Cassandra 2.1.2, 
> java version "1.7.0_71"
> Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
>            Reporter: Sergey Maznichenko
>            Assignee: Sam Tunnicliffe
>             Fix For: 2.1.5
>
>         Attachments: cassandra_crash1.txt
>
>
> Hello,
> We have Cassandra 2.1.2 with 8 nodes, 4 in DC1 and 4 in DC2.
> Node is VM 8 CPU, 32GB RAM
> During significant workload (loading several millions blobs ~3.5MB each), 1 
> node in DC2 stops and after some time next 2 nodes in DC2 also stops.
> Now, 2 of nodes in DC2 do not work and stops after 5-10 minutes after start. 
> I see many files in system.hints table and error appears in 2-3 minutes after 
> starting system.hints auto compaction.
> Stops, means "ERROR [CompactionExecutor:1] 2015-04-01 23:33:44,456 
> CassandraDaemon.java:153 - Exception in thread 
> Thread[CompactionExecutor:1,1,main]
> java.lang.OutOfMemoryError: Java heap space"
> ERROR [HintedHandoff:1] 2015-04-01 23:33:44,456 CassandraDaemon.java:153 - 
> Exception in thread Thread[HintedHandoff:1,1,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.OutOfMemoryError: Java heap space
> Full errors listing attached in cassandra_crash1.txt
> The problem exists only in DC2. We have 1GbE between DC1 and DC2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9092) Nodes in DC2 die during and after huge write workload

Reply via email to