Hi All -

I am trying to bootstrap a replacement node in a cluster, but it consistently 
fails to bootstrap because of OOM exceptions. For almost a week I've been going 
through cycles of bootstrapping, finding errors, then restarting / resuming 
bootstrap, and I am struggling to move forward. Sometimes the bootstrapping 
node itself fails, which usually manifests first as very high GC times 
(sometimes 30s+!), then nodetool commands start to fail with timeouts, then the 
node will crash with an OOM exception. Other times, a node streaming data to 
this bootstrapping node will have a similar failure. In either case, when it 
happens I need to restart the crashed node, then resume the bootstrap.

On top of these issues, when I do need to restart a node it takes a loooong 
time 
(http://stackoverflow.com/questions/40141739/why-does-cassandra-sometimes-take-a-hours-to-start).
 This exasperates the problem because it takes so long to find out if a change 
to the cluster helps or if it still fails. I am in the process of upgrading all 
nodes in the cluster from m4.xlarge to c4.4xlarge, and I am running Cassandra 
DDC 3.5 on all nodes. The cluster has 26 nodes spread across 4 regions in EC2. 
Here is some other relevant cluster info (also in stack overflow post):

Cluster Info

  *   Cassandra DDC 3.5
  *   EC2MultiRegionSnitch
  *   m4.xlarge, moving to c4.4xlarge

Schema Info

  *   3 CF's, all 'write once' (ie no updates), 1 week ttl, STCS (default)
  *   no secondary indexes

I am unsure what to try next. The node that is currently having this bootstrap 
problem is a pretty beefy box, with 16 cores, 30G of ram, and a 3.2T EBS 
volume. The slow startup time might be because of the issues with a high number 
of SSTables that Jeff Jirsa mentioned in a comment on the SO post, but I am at 
a loss for the OOM issues. I've tried:

  *   Changing from CMS to G1 GC, which seemed to have helped a bit
  *   Upgrading from 3.5 to 3.9, which did not seem to help
  *   Upgrading instance types from m4.xlarge to c4.4xlarge, which seems to 
help, but I'm still having issues

I'd appreciate any suggestions on what else I can try to track down the cause 
of these OOM exceptions.

- Mike

Reply via email to