[jira] [Commented] (CASSANDRA-4225) EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI

Delaney Manders (JIRA) Tue, 08 May 2012 09:56:11 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13270606#comment-13270606
 ]


Delaney Manders commented on CASSANDRA-4225:
--------------------------------------------

Deploying that change now, I'll report back w/ the next crash.  Thanks Brandon. 
:)
                
> EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-4225
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4225
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.0
>         Environment: Amazon Linux AMI release 2012.03
> 3.2.12-3.2.4.amzn1.x86_64
> m1.xlarge
> Nodes have:
> Cassandra built and installed from source.
> Ant binary (apache-ant-1.8.3-bin.tar.gz), automake(1.11.1), autoconf(2.64), 
> libtool(2.2.10) installed from AWS repository.
> Sun Java:
> > java -version
> java version "1.6.0_31"
> Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
> Only system changes are:
> echo "root soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
> echo "root hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
> Setup scripts available.
> Cassandra cluster has two datacenters, with DC1 having 8 nodes and DC2 having 
> 4, DC2 being reserved for Hadoop jobs.  DC2 nodes have not had the same 
> frequency of hard crashes, though it has happened.
> Storage is set up with 4 ephemeral drives raided for commit, 4 EBS drives 
> raided for storage.
> Usage is exclusively write, with all mutations being done in batch mutations, 
> where each batch mutation has a set of columns added/modified to a single 
> key.  There are ~2000 threads streaming batch mutations from a web edge of 
> varying size, distributed across DC1.  Client is Hector(1.0-5) w/ 
> DynamicLoadBalancing.
> In an effort to mitigate this issue, I've removed jna.jar & platform.jar from 
> $CASSANDRA_HOME/lib, and set disk_access_mode: standard in 
> $CASSANDRA_HOME/conf.cassandra.yaml.  Neither has seemed to help.
>            Reporter: Delaney Manders
>
> At fairly random intervals, about once/day, one of my Cassandra nodes does a 
> hard crash (kernel panic).  
>   
> I can find no system logs (/var/log/*) which have any errors.  No cassandra 
> logs have any errors.  
>   
> On one machine I was watching as it went down, and caught the following 
> comment:  
> > Message from syslogd@domU-12-31-38-00-64-31 at May  3 18:24:17 ...
> >  kernel:[252906.019808] Oops: 0002 [#1] SMP
> An AWS support guy found one entry in the console logs:
> > [30178.298308] Pid: 2238, comm: java Not tainted 3.2.12-3.2.4.amzn1.x86_64 
> > #1
> I've replaced two of the nodes with new instances, but all are showing the 
> same behaviour.
> It's very reproduceable on my system, though it takes a little waiting.  
> Leaving it running is no big deal for another day or so, I just need to 
> restart Cassandra every once in a while when I get alerted.  
> I'm open to any additional requested debugging steps before bailing and going 
> back to 1.0.9.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4225) EC2 nodes randomly hard-crash the machine on newest EC2 Linux AMI

Reply via email to