[
https://issues.apache.org/jira/browse/CASSANDRA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270606#comment-13270606
]
Delaney Manders commented on CASSANDRA-4225:
--------------------------------------------
Deploying that change now, I'll report back w/ the next crash. Thanks Brandon.
:)
> EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI
> -----------------------------------------------------------------
>
> Key: CASSANDRA-4225
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4225
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Affects Versions: 1.1.0
> Environment: Amazon Linux AMI release 2012.03
> 3.2.12-3.2.4.amzn1.x86_64
> m1.xlarge
> Nodes have:
> Cassandra built and installed from source.
> Ant binary (apache-ant-1.8.3-bin.tar.gz), automake(1.11.1), autoconf(2.64),
> libtool(2.2.10) installed from AWS repository.
> Sun Java:
> > java -version
> java version "1.6.0_31"
> Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
> Only system changes are:
> echo "root soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
> echo "root hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
> Setup scripts available.
> Cassandra cluster has two datacenters, with DC1 having 8 nodes and DC2 having
> 4, DC2 being reserved for Hadoop jobs. DC2 nodes have not had the same
> frequency of hard crashes, though it has happened.
> Storage is set up with 4 ephemeral drives raided for commit, 4 EBS drives
> raided for storage.
> Usage is exclusively write, with all mutations being done in batch mutations,
> where each batch mutation has a set of columns added/modified to a single
> key. There are ~2000 threads streaming batch mutations from a web edge of
> varying size, distributed across DC1. Client is Hector(1.0-5) w/
> DynamicLoadBalancing.
> In an effort to mitigate this issue, I've removed jna.jar & platform.jar from
> $CASSANDRA_HOME/lib, and set disk_access_mode: standard in
> $CASSANDRA_HOME/conf.cassandra.yaml. Neither has seemed to help.
> Reporter: Delaney Manders
>
> At fairly random intervals, about once/day, one of my Cassandra nodes does a
> hard crash (kernel panic).
>
> I can find no system logs (/var/log/*) which have any errors. No cassandra
> logs have any errors.
>
> On one machine I was watching as it went down, and caught the following
> comment:
> > Message from syslogd@domU-12-31-38-00-64-31 at May 3 18:24:17 ...
> > kernel:[252906.019808] Oops: 0002 [#1] SMP
> An AWS support guy found one entry in the console logs:
> > [30178.298308] Pid: 2238, comm: java Not tainted 3.2.12-3.2.4.amzn1.x86_64
> > #1
> I've replaced two of the nodes with new instances, but all are showing the
> same behaviour.
> It's very reproduceable on my system, though it takes a little waiting.
> Leaving it running is no big deal for another day or so, I just need to
> restart Cassandra every once in a while when I get alerted.
> I'm open to any additional requested debugging steps before bailing and going
> back to 1.0.9.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira