Delaney Manders created CASSANDRA-4225:
------------------------------------------
Summary: EC2 nodes randomly hard-crash the machine on newest EC2
Linux AMI
Key: CASSANDRA-4225
URL: https://issues.apache.org/jira/browse/CASSANDRA-4225
Project: Cassandra
Issue Type: Bug
Components: Core
Affects Versions: 1.1.0
Environment: Amazon Linux AMI release 2012.03
3.2.12-3.2.4.amzn1.x86_64
m1.xlarge
Nodes have:
Cassandra built and installed from source.
Ant binary (apache-ant-1.8.3-bin.tar.gz), automake(1.11.1), autoconf(2.64),
libtool(2.2.10) installed from AWS repository.
Sun Java:
> java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
Only system changes are:
echo "root soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "root hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
Setup scripts available.
Cassandra cluster has two datacenters, with DC1 having 8 nodes and DC2 having
4, DC2 being reserved for Hadoop jobs. DC2 nodes have not had the same
frequency of hard crashes, though it has happened.
Storage is set up with 4 ephemeral drives raided for commit, 4 EBS drives
raided for storage.
Usage is exclusively write, with all mutations being done in batch mutations,
where each batch mutation has a set of columns added/modified to a single key.
There are ~2000 threads streaming batch mutations from a web edge of varying
size, distributed across DC1. Client is Hector(1.0-5) w/ DynamicLoadBalancing.
In an effort to mitigate this issue, I've removed jna.jar & platform.jar from
$CASSANDRA_HOME/lib, and set disk_access_mode: standard in
$CASSANDRA_HOME/conf.cassandra.yaml. Neither has seemed to help.
Reporter: Delaney Manders
At fairly random intervals, about once/day, one of my Cassandra nodes does a
hard crash (kernel panic).
I can find no system logs (/var/log/*) which have any errors. No cassandra
logs have any errors.
On one machine I was watching as it went down, and caught the following
comment:
> Message from syslogd@domU-12-31-38-00-64-31 at May 3 18:24:17 ...
> kernel:[252906.019808] Oops: 0002 [#1] SMP
An AWS support guy found one entry in the console logs:
> [30178.298308] Pid: 2238, comm: java Not tainted 3.2.12-3.2.4.amzn1.x86_64 #1
I've replaced two of the nodes with new instances, but all are showing the same
behaviour.
It's very reproduceable on my system, though it takes a little waiting.
Leaving it running is no big deal for another day or so, I just need to restart
Cassandra every once in a while when I get alerted.
I'm open to any additional requested debugging steps before bailing and going
back to 1.0.9.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira