[graylog2] Re: large searches kill ES - can graylog stop this?

Jochen Schalanda Fri, 03 Jun 2016 08:27:06 -0700

Hi Jason,

you can restrict the time range users are allowed to run queries in since 
Graylog 2.0.0 (see System -> Configurations -> Searches configuration). 
Other than that, it would help to split your indices into more shards (and 
distribute them on more Elasticsearch nodes).


This being said, the error you've mentioned (field data cache being full) 
most often occurs if the mentioned field ("message" in this case) is being 
used in an aggregation (e. g. Quick Values), so you might want to avoid 
those on analyzed fields like "message", "full_message", and "source".

Cheers,
Jochen

On Wednesday, 1 June 2016 02:48:24 UTC+2, Jason Haar wrote:
>
> Hi there
>
> I just did a simple search on 30 days of data and managed to trigger the 
> following ES error
>
> [2016-06-01 00:12:53,525][WARN ][indices.breaker.fielddata] [fielddata] 
> New used memory 11273780309 [10.4gb] for data of [message] would be larger 
> than configured breaker: 10857952051 [10.1gb], breaking
>
>
> According to what I can google, this means that ES would have had to 
> allocate more resources than available to fulfil it, and that condition 
> somehow triggers an epic fail: either ES becomes unresponsive or 
> graylog-server does - I can't tell the difference. All I know is right now 
> I have messages going into graylog and nothing coming out.
>
> Within a minute, things went bad to worse, suddenly I'm getting shard 
> errors (first shard errors in ages - definitely related)
>
> [2016-06-01 00:21:32,860][WARN ][indices.cluster          ] [fantail] 
> [[graylog_488][0]] marking and sending shard failed due to [engine failure, 
> reason [already closed by tragic event on the index writer]]
> [graylog_488][[graylog_488][0]] ShardNotFoundException[no such shard]
> at org.elasticsearch.index.IndexService.shardSafe(IndexService.java:197)
> [2016-06-01 00:21:32,962][WARN ][cluster.action.shard     ] [fantail] 
> [graylog_488][0] received shard failed for target shard [[graylog_488][0], 
> node[Tjzmk9cFRuCke6JEuomb4g], [P], v[2], s[STARTED], 
> a[id=dgyATFPBQAywkydc2mxmPw]], indexUUID [jxF7U5fESqOzJu9CSDF3WA], message 
> [engine failure, reason [already closed by tragic event on the index 
> writer]], failure [OutOfMemoryError[Java heap space]]
> [2016-06-01 00:21:32,974][WARN ][cluster.action.shard     ] [fantail] 
> [graylog_488][0] received shard failed for target shard [[graylog_488][0], 
> node[Tjzmk9cFRuCke6JEuomb4g], [P], v[2], s[STARTED], 
> a[id=dgyATFPBQAywkydc2mxmPw]], indexUUID [jxF7U5fESqOzJu9CSDF3WA], message 
> [master {fantail}{Tjzmk9cFRuCke6JEuomb4g}{127.0.0.1}{127.0.0.1:9300} 
> marked shard as started, but shard has previous failed. resending shard 
> failure.]
> [2016-06-01 00:21:33,182][INFO ][cluster.routing.allocation] [fantail] 
> Cluster health status changed from [GREEN] to [RED] (reason: [shards failed 
> [[graylog_488][0], [graylog_488][0]] ...]).
>
>
>
> Restarting graylog-server and ES (and cleaning up...) will solve this - 
> but this is lame. graylog is an end-user tool that *by design* will have 
> people doing actions that - on occasion - are beyond the reach of the 
> backend: there has to be some way this could be handled better. The ES 
> people seem to think this is a case of "you're doing it wrong", but graylog 
> isn't some programmed frontend where every ES call is tightly managed - 
> it's something that is meant to be used to "play" with data. Basically all 
> I did was take a previous search that worked and asked it to re-run with an 
> hourly graph instead of daily - enough to tip it over the edge. This will 
> happen time and time again - so causing service outages is an acceptable 
> outcome?
>
> How are others dealing with this? Could graylog capture the ES error and 
> mitigate (somehow)? I for one should have shut everything down before that 
> "breaker" error turned into the "shard" error.
>
> This is graylog-server-2.0.2/elasticsearch-2.3.3 under CentOS-7
>
> Thanks
>
> Jason
>

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/b57e1a7d-4ae7-4e15-814d-a3f1ddf5e093%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[graylog2] Re: large searches kill ES - can graylog stop this?

Reply via email to