We have a 4 node cluster in AWS that looks like this: 

1 x m4.2xlarge - runs all Graylog roles, processes incoming messages
3 x m4.xlarge - runs as "backend" roles - ES, graylog-server, etcd, mongo 

All nodes have a 2.4TB EBS-backed data volume. We store about 4TB (2.5 
billion messages, 1800 indices, 2gb per) of data. We use the provided AMIs 
and use the 1.2.1 omnibus package - rather, we started from the 1.1.4 
provided image and upgraded each instance to 1.1.6, 1.2.0, and now 1.2.1

When restarting even a single node in ES, after a few minutes, usually on 
the ES master node, we end up with JVM warnings in the ES log:

2015-10-02_21:43:28.48251 [2015-10-02 21:43:28,482][WARN ][monitor.jvm     
         ] [Ms. MODOK] [gc][old][2628][41] duration [19.3s], collections 
[1]/[20s], total [19.3s]/[2m], memory [9.1gb]->[9.1gb]/[9.3gb], all_pools 
{[young] [35.7mb]->[36.5mb]/[266.2mb]}{[survivor] 
[0b]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]}
2015-10-02_21:43:55.40431 [2015-10-02 21:43:55,404][WARN ][monitor.jvm     
         ] [Ms. MODOK] [gc][old][2630][42] duration [25.7s], collections 
[1]/[25.8s], total [25.7s]/[2.5m], memory [9.3gb]->[9gb]/[9.3gb], all_pools 
{[young] [266.2mb]->[21.1mb]/[266.2mb]}{[survivor] 
[23.1mb]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]}
2015-10-02_21:43:56.30787 [2015-10-02 21:43:56,307][WARN ][cluster.service 
         ] [Ms. MODOK] cluster state update task [shard-started 
([graylog_3873][0], node[5r9lBQcNQRCxPRwovXtzyg], [R], s[INITIALIZING], 
unassigned_info[[reason=NODE_LEFT], at[2015-10-02T21:19:44.518Z], 
details[node_left[CZIGj-wJQFqOYP0ZWdWHdg]]]), reason [after recovery 
(replica) from node [[Ms. 
MODOK][QvZjbJ9kR12F41NZwNygTg][example.com][inet[/x.x.x.x:9300]]]]] took 
1.2m above the warn threshold of 30s
2015-10-02_21:44:16.95477 [2015-10-02 21:44:16,954][WARN ][monitor.jvm     
         ] [Ms. MODOK] [gc][old][2632][43] duration [19.6s], collections 
[1]/[20.5s], total [19.6s]/[2.8m], memory [9.3gb]->[9.1gb]/[9.3gb], 
all_pools {[young] [254.3mb]->[49.7mb]/[266.2mb]}{[survivor] 
[0b]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]}

At that point, ES becomes slow to respond to almost every request, such 
as /_cluster/health/

It's very tricky to get ES to a fully initialized state. Eventually 
unassigned_shards stops progressing. I usually have to restart ES on the 
machine where the JVM errors are thrown and hope it does better on the next 
round. 

In my elasticsearch.yml, I have "bootstrap.mlockall: true" and it is in 
effect, as far as the API can see on all nodes. 

Any suggestions out there?

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/2d6b906b-4bf5-4493-b7fc-a227edac6146%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to