[graylog2] large searches kill ES - can graylog stop this?

Jason Haar Tue, 31 May 2016 17:49:05 -0700

Hi there

I just did a simple search on 30 days of data and managed to trigger the 
following ES error


[2016-06-01 00:12:53,525][WARN ][indices.breaker.fielddata] [fielddata] New 
used memory 11273780309 [10.4gb] for data of [message] would be larger than 
configured breaker: 10857952051 [10.1gb], breaking


According to what I can google, this means that ES would have had to 
allocate more resources than available to fulfil it, and that condition 
somehow triggers an epic fail: either ES becomes unresponsive or 
graylog-server does - I can't tell the difference. All I know is right now 
I have messages going into graylog and nothing coming out.

Within a minute, things went bad to worse, suddenly I'm getting shard 
errors (first shard errors in ages - definitely related)

[2016-06-01 00:21:32,860][WARN ][indices.cluster          ] [fantail] 
[[graylog_488][0]] marking and sending shard failed due to [engine failure, 
reason [already closed by tragic event on the index writer]]
[graylog_488][[graylog_488][0]] ShardNotFoundException[no such shard]
at org.elasticsearch.index.IndexService.shardSafe(IndexService.java:197)
[2016-06-01 00:21:32,962][WARN ][cluster.action.shard     ] [fantail] 
[graylog_488][0] received shard failed for target shard [[graylog_488][0], 
node[Tjzmk9cFRuCke6JEuomb4g], [P], v[2], s[STARTED], 
a[id=dgyATFPBQAywkydc2mxmPw]], indexUUID [jxF7U5fESqOzJu9CSDF3WA], message 
[engine failure, reason [already closed by tragic event on the index 
writer]], failure [OutOfMemoryError[Java heap space]]
[2016-06-01 00:21:32,974][WARN ][cluster.action.shard     ] [fantail] 
[graylog_488][0] received shard failed for target shard [[graylog_488][0], 
node[Tjzmk9cFRuCke6JEuomb4g], [P], v[2], s[STARTED], 
a[id=dgyATFPBQAywkydc2mxmPw]], indexUUID [jxF7U5fESqOzJu9CSDF3WA], message 
[master {fantail}{Tjzmk9cFRuCke6JEuomb4g}{127.0.0.1}{127.0.0.1:9300} marked 
shard as started, but shard has previous failed. resending shard failure.]
[2016-06-01 00:21:33,182][INFO ][cluster.routing.allocation] [fantail] 
Cluster health status changed from [GREEN] to [RED] (reason: [shards failed 
[[graylog_488][0], [graylog_488][0]] ...]).



Restarting graylog-server and ES (and cleaning up...) will solve this - but 
this is lame. graylog is an end-user tool that *by design* will have people 
doing actions that - on occasion - are beyond the reach of the backend: 
there has to be some way this could be handled better. The ES people seem 
to think this is a case of "you're doing it wrong", but graylog isn't some 
programmed frontend where every ES call is tightly managed - it's something 
that is meant to be used to "play" with data. Basically all I did was take 
a previous search that worked and asked it to re-run with an hourly graph 
instead of daily - enough to tip it over the edge. This will happen time 
and time again - so causing service outages is an acceptable outcome?

How are others dealing with this? Could graylog capture the ES error and 
mitigate (somehow)? I for one should have shut everything down before that 
"breaker" error turned into the "shard" error.

This is graylog-server-2.0.2/elasticsearch-2.3.3 under CentOS-7

Thanks

Jason

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/b7a7b095-3b6d-47fb-8bb0-bc62b8b67011%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[graylog2] large searches kill ES - can graylog stop this?

Reply via email to