Delaney Manders created CASSANDRA-4225:
------------------------------------------

             Summary: EC2 nodes randomly hard-crash the machine on newest EC2 
Linux AMI
                 Key: CASSANDRA-4225
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4225
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.1.0
         Environment: Amazon Linux AMI release 2012.03
3.2.12-3.2.4.amzn1.x86_64
m1.xlarge

Nodes have:
Cassandra built and installed from source.
Ant binary (apache-ant-1.8.3-bin.tar.gz), automake(1.11.1), autoconf(2.64), 
libtool(2.2.10) installed from AWS repository.
Sun Java:

> java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)

Only system changes are:
echo "root soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "root hard memlock unlimited" | sudo tee -a /etc/security/limits.conf

Setup scripts available.

Cassandra cluster has two datacenters, with DC1 having 8 nodes and DC2 having 
4, DC2 being reserved for Hadoop jobs.  DC2 nodes have not had the same 
frequency of hard crashes, though it has happened.

Storage is set up with 4 ephemeral drives raided for commit, 4 EBS drives 
raided for storage.

Usage is exclusively write, with all mutations being done in batch mutations, 
where each batch mutation has a set of columns added/modified to a single key.  
There are ~2000 threads streaming batch mutations from a web edge of varying 
size, distributed across DC1.  Client is Hector(1.0-5) w/ DynamicLoadBalancing.

In an effort to mitigate this issue, I've removed jna.jar & platform.jar from 
$CASSANDRA_HOME/lib, and set disk_access_mode: standard in 
$CASSANDRA_HOME/conf.cassandra.yaml.  Neither has seemed to help.
            Reporter: Delaney Manders


At fairly random intervals, about once/day, one of my Cassandra nodes does a 
hard crash (kernel panic).  
  
I can find no system logs (/var/log/*) which have any errors.  No cassandra 
logs have any errors.  
  
On one machine I was watching as it went down, and caught the following 
comment:  
> Message from syslogd@domU-12-31-38-00-64-31 at May  3 18:24:17 ...
>  kernel:[252906.019808] Oops: 0002 [#1] SMP

An AWS support guy found one entry in the console logs:
> [30178.298308] Pid: 2238, comm: java Not tainted 3.2.12-3.2.4.amzn1.x86_64 #1

I've replaced two of the nodes with new instances, but all are showing the same 
behaviour.

It's very reproduceable on my system, though it takes a little waiting.  
Leaving it running is no big deal for another day or so, I just need to restart 
Cassandra every once in a while when I get alerted.  

I'm open to any additional requested debugging steps before bailing and going 
back to 1.0.9.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to