We have a 4 node cluster in AWS that looks like this:
1 x m4.2xlarge - runs all Graylog roles, processes incoming messages
3 x m4.xlarge - runs as "backend" roles - ES, graylog-server, etcd, mongo
All nodes have a 2.4TB EBS-backed data volume. We store about 4TB (2.5
billion messages, 1800 indices, 2gb per) of data. We use the provided AMIs
and use the 1.2.1 omnibus package - rather, we started from the 1.1.4
provided image and upgraded each instance to 1.1.6, 1.2.0, and now 1.2.1
When restarting even a single node in ES, after a few minutes, usually on
the ES master node, we end up with JVM warnings in the ES log:
2015-10-02_21:43:28.48251 [2015-10-02 21:43:28,482][WARN ][monitor.jvm
] [Ms. MODOK] [gc][old][2628][41] duration [19.3s], collections
[1]/[20s], total [19.3s]/[2m], memory [9.1gb]->[9.1gb]/[9.3gb], all_pools
{[young] [35.7mb]->[36.5mb]/[266.2mb]}{[survivor]
[0b]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]}
2015-10-02_21:43:55.40431 [2015-10-02 21:43:55,404][WARN ][monitor.jvm
] [Ms. MODOK] [gc][old][2630][42] duration [25.7s], collections
[1]/[25.8s], total [25.7s]/[2.5m], memory [9.3gb]->[9gb]/[9.3gb], all_pools
{[young] [266.2mb]->[21.1mb]/[266.2mb]}{[survivor]
[23.1mb]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]}
2015-10-02_21:43:56.30787 [2015-10-02 21:43:56,307][WARN ][cluster.service
] [Ms. MODOK] cluster state update task [shard-started
([graylog_3873][0], node[5r9lBQcNQRCxPRwovXtzyg], [R], s[INITIALIZING],
unassigned_info[[reason=NODE_LEFT], at[2015-10-02T21:19:44.518Z],
details[node_left[CZIGj-wJQFqOYP0ZWdWHdg]]]), reason [after recovery
(replica) from node [[Ms.
MODOK][QvZjbJ9kR12F41NZwNygTg][example.com][inet[/x.x.x.x:9300]]]]] took
1.2m above the warn threshold of 30s
2015-10-02_21:44:16.95477 [2015-10-02 21:44:16,954][WARN ][monitor.jvm
] [Ms. MODOK] [gc][old][2632][43] duration [19.6s], collections
[1]/[20.5s], total [19.6s]/[2.8m], memory [9.3gb]->[9.1gb]/[9.3gb],
all_pools {[young] [254.3mb]->[49.7mb]/[266.2mb]}{[survivor]
[0b]->[0b]/[33.2mb]}{[old] [9gb]->[9gb]/[9gb]}
At that point, ES becomes slow to respond to almost every request, such
as /_cluster/health/
It's very tricky to get ES to a fully initialized state. Eventually
unassigned_shards stops progressing. I usually have to restart ES on the
machine where the JVM errors are thrown and hope it does better on the next
round.
In my elasticsearch.yml, I have "bootstrap.mlockall: true" and it is in
effect, as far as the API can see on all nodes.
Any suggestions out there?
Thanks!
--
You received this message because you are subscribed to the Google Groups
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/graylog2/2d6b906b-4bf5-4493-b7fc-a227edac6146%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.