Hi there
I just did a simple search on 30 days of data and managed to trigger the
following ES error
[2016-06-01 00:12:53,525][WARN ][indices.breaker.fielddata] [fielddata] New
used memory 11273780309 [10.4gb] for data of [message] would be larger than
configured breaker: 10857952051 [10.1gb], breaking
According to what I can google, this means that ES would have had to
allocate more resources than available to fulfil it, and that condition
somehow triggers an epic fail: either ES becomes unresponsive or
graylog-server does - I can't tell the difference. All I know is right now
I have messages going into graylog and nothing coming out.
Within a minute, things went bad to worse, suddenly I'm getting shard
errors (first shard errors in ages - definitely related)
[2016-06-01 00:21:32,860][WARN ][indices.cluster ] [fantail]
[[graylog_488][0]] marking and sending shard failed due to [engine failure,
reason [already closed by tragic event on the index writer]]
[graylog_488][[graylog_488][0]] ShardNotFoundException[no such shard]
at org.elasticsearch.index.IndexService.shardSafe(IndexService.java:197)
[2016-06-01 00:21:32,962][WARN ][cluster.action.shard ] [fantail]
[graylog_488][0] received shard failed for target shard [[graylog_488][0],
node[Tjzmk9cFRuCke6JEuomb4g], [P], v[2], s[STARTED],
a[id=dgyATFPBQAywkydc2mxmPw]], indexUUID [jxF7U5fESqOzJu9CSDF3WA], message
[engine failure, reason [already closed by tragic event on the index
writer]], failure [OutOfMemoryError[Java heap space]]
[2016-06-01 00:21:32,974][WARN ][cluster.action.shard ] [fantail]
[graylog_488][0] received shard failed for target shard [[graylog_488][0],
node[Tjzmk9cFRuCke6JEuomb4g], [P], v[2], s[STARTED],
a[id=dgyATFPBQAywkydc2mxmPw]], indexUUID [jxF7U5fESqOzJu9CSDF3WA], message
[master {fantail}{Tjzmk9cFRuCke6JEuomb4g}{127.0.0.1}{127.0.0.1:9300} marked
shard as started, but shard has previous failed. resending shard failure.]
[2016-06-01 00:21:33,182][INFO ][cluster.routing.allocation] [fantail]
Cluster health status changed from [GREEN] to [RED] (reason: [shards failed
[[graylog_488][0], [graylog_488][0]] ...]).
Restarting graylog-server and ES (and cleaning up...) will solve this - but
this is lame. graylog is an end-user tool that *by design* will have people
doing actions that - on occasion - are beyond the reach of the backend:
there has to be some way this could be handled better. The ES people seem
to think this is a case of "you're doing it wrong", but graylog isn't some
programmed frontend where every ES call is tightly managed - it's something
that is meant to be used to "play" with data. Basically all I did was take
a previous search that worked and asked it to re-run with an hourly graph
instead of daily - enough to tip it over the edge. This will happen time
and time again - so causing service outages is an acceptable outcome?
How are others dealing with this? Could graylog capture the ES error and
mitigate (somehow)? I for one should have shut everything down before that
"breaker" error turned into the "shard" error.
This is graylog-server-2.0.2/elasticsearch-2.3.3 under CentOS-7
Thanks
Jason
--
You received this message because you are subscribed to the Google Groups
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/graylog2/b7a7b095-3b6d-47fb-8bb0-bc62b8b67011%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.