[jira] [Commented] (CASSANDRA-14840) Bootstrap of new node fails with OOM in a large cluster

Jeff Jirsa (JIRA) Tue, 23 Oct 2018 22:30:32 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661748#comment-16661748
 ]


Jeff Jirsa commented on CASSANDRA-14840:
----------------------------------------

This is a duplicate of CASSANDRA-11748 and/or CASSANDRA-13569 - what's 
happening is that when the new instance comes online, it pulls schema from all 
of the other instances in the cluster at once, getting 80+ copies of what's 
probably a very large schema all at once. 

If you really have no data in any of those tables, the easiest solution may be 
to start removing them to decrease schema size and make the thundering herd of 
schema mutations less painful (this may be a viable option if the tables are 
old and unused - if you expect to use them again, keep reading).

Beyond that, you have two options:
1) Try to make it so you can better handle all of the incoming mutations - this 
may mean a bigger heap, tuning the memtable, or similar. Hard to give concrete 
suggestions without a heap dump and knowing your current settings. Offheap 
memtable may be a starting point given you're on 2.1.
2) Try to limit the number of concurrent migrations - this is going to sound 
awful, for obvious reasons, but one of the things that may work is to 
artificially restrict your instance's view of the ring using firewall rules so 
it can only communicate with a handful of hosts (maybe just the seeds) for the 
first 5-15 seconds after it starts, then once it's got the schema, remove the 
rules allowing it to talk to the rest of the cluster so it can properly 
bootstrap.

One of the other two JIRAs will eventually get addressed; I'm going to dupe 
this to CASSANDRA-11748 since it's a lower number (earlier reporting). 

> Bootstrap of new node fails with OOM in a large cluster
> -------------------------------------------------------
>
>                 Key: CASSANDRA-14840
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14840
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Jai Bheemsen Rao Dhanwada
>            Priority: Critical
>
> We are seeing new node addition fails with OOM during bootstrap in a cluster 
> of more than 80 nodes and 3000 CF without any data in those CFs.
>  
> Steps to reproduce:
>  # Launch a 3 node cluster
>  # Create 3000 CF in the cluster
>  # Start adding nodes to the cluster one by one
>  # After adding 75-80 nodes, the new node bootstrap fails with OOM.
> {code:java}
> ERROR [PERIODIC-COMMIT-LOG-SYNCER] 2018-10-24 03:26:47,870 
> JVMStabilityInspector.java:78 - Exiting due to error while processing commit 
> log during initialization.
> java.lang.OutOfMemoryError: Java heap space
>  at java.util.regex.Pattern.matcher(Pattern.java:1093) ~[na:1.8.0_151]
>  at java.util.Formatter.parse(Formatter.java:2547) ~[na:1.8.0_151]
>  at java.util.Formatter.format(Formatter.java:2501) ~[na:1.8.0_151]
>  at java.util.Formatter.format(Formatter.java:2455) ~[na:1.8.0_151]
>  at java.lang.String.format(String.java:2940) ~[na:1.8.0_151]
>  at 
> org.apache.cassandra.db.commitlog.AbstractCommitLogService$1.run(AbstractCommitLogService.java:105)
>  ~[apache-cassandra-2.1.16.jar:2.1.16]
>  at java.lang.Thread.run(Thread.java:748) [na:1.8.0_151]{code}
> Cassandra Version: 2.1.16
> OS: CentOS7
> num_tokens: 256 on each node.
>  
> This behavior is blocking us from adding extra capacity when needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14840) Bootstrap of new node fails with OOM in a large cluster

Reply via email to