Hi All - I am trying to bootstrap a replacement node in a cluster, but it consistently fails to bootstrap because of OOM exceptions. For almost a week I've been going through cycles of bootstrapping, finding errors, then restarting / resuming bootstrap, and I am struggling to move forward. Sometimes the bootstrapping node itself fails, which usually manifests first as very high GC times (sometimes 30s+!), then nodetool commands start to fail with timeouts, then the node will crash with an OOM exception. Other times, a node streaming data to this bootstrapping node will have a similar failure. In either case, when it happens I need to restart the crashed node, then resume the bootstrap.
On top of these issues, when I do need to restart a node it takes a loooong time (http://stackoverflow.com/questions/40141739/why-does-cassandra-sometimes-take-a-hours-to-start). This exasperates the problem because it takes so long to find out if a change to the cluster helps or if it still fails. I am in the process of upgrading all nodes in the cluster from m4.xlarge to c4.4xlarge, and I am running Cassandra DDC 3.5 on all nodes. The cluster has 26 nodes spread across 4 regions in EC2. Here is some other relevant cluster info (also in stack overflow post): Cluster Info * Cassandra DDC 3.5 * EC2MultiRegionSnitch * m4.xlarge, moving to c4.4xlarge Schema Info * 3 CF's, all 'write once' (ie no updates), 1 week ttl, STCS (default) * no secondary indexes I am unsure what to try next. The node that is currently having this bootstrap problem is a pretty beefy box, with 16 cores, 30G of ram, and a 3.2T EBS volume. The slow startup time might be because of the issues with a high number of SSTables that Jeff Jirsa mentioned in a comment on the SO post, but I am at a loss for the OOM issues. I've tried: * Changing from CMS to G1 GC, which seemed to have helped a bit * Upgrading from 3.5 to 3.9, which did not seem to help * Upgrading instance types from m4.xlarge to c4.4xlarge, which seems to help, but I'm still having issues I'd appreciate any suggestions on what else I can try to track down the cause of these OOM exceptions. - Mike