Re: Unresponsive cluster after too large of a query (OutOfMemoryError: Java heap space)?

Nate Fox Wed, 07 May 2014 09:18:07 -0700

Thanks Jörg. I think my experiences in other data stores (mongo, sql) has 
me thinking that ES is similar, when in reality its a different tool with 
different pros and cons. As for configuration, we're basically running 
stock (enabled slow logs, unicast and site tagging). I think ultimately ES 
is heavily reliant upon not swapping, so the memory you give it needs to 
hold all of your results. Where in my past I've allowed SQL servers to swap 
to handle larger loads.


We'll play with some other settings. I liken Kibana to handing someone with 
no SQL best practices knowledge a 5Tb sql database and a GUI query builder 
and being surprised when they join 50 tables and bring the SQL server to 
its knees :)


On Tuesday, May 6, 2014 5:32:51 AM UTC-7, Jörg Prante wrote:
>
> ES has a lot of failsafe mechanisms against "OutOfMemoryError" built in:
>
> - thread pools are strict, they do not grow endlessly
> - field cache usage is limited
> - a field circuit breaker helps to terminate queries early before too much 
> memory is consumed
> - closing unused indices frees heap resources that are no longer required
> - balancing shards over nodes for equalizing resource usage over the nodes
> - catching "Throwable" in several critical modules to allow spontaneous 
> recovery from temporary JVM OOMs (e.g. if GC is too slow)
>
> Nevertheless you can override defaults and get into the "red area" where 
> an ES node is no longer able to react properly over the API, also because of
>
> - misconfigurations
> - "bad behaving" queries which exploit CPU usage or exceed available heap 
> in unpredictable ways
> - unexpected, huge query loads, large result sets
> - sudden peaks of resource usage, e.g. while merging large segments, or 
> bulk indexing
> - distorted document/term distribution over shards that knock out equal 
> shard balancing
> - etc.
>
> Unresponsive nodes are taken out from the cluster after a few seconds, so 
> this is not really a problem, unless you have no replica, or the cluster 
> can't keep up with recovery from such events.
>
> There is no known mechanism to protect you automatically from crossing the 
> line to the "red area" when a JVM can not recover from OOM and gets 
> unresponsive. This is not specific to ES but to all JVM applications.
>
> Best practice is "know your data, know your nodes". Exercise your ES 
> cluster before putting real data on it to get an idea of the maximum 
> capacity of a node or the whole cluster and the best configuration options, 
> and put a proxy before ES to allow only "well behaving" actions.
>
> Jörg
>
>
>
> On Tue, May 6, 2014 at 2:06 AM, Nate Fox <[email protected] 
> <javascript:>>wrote:
>
>> Is there any way to prevent ES from blowing up just by selecting too much 
>> data? This is my biggest concern.
>> Is it because the bootstrap.mlockall is on, so we give ES/JVM a specified 
>> amount of memory and thats all that node will receive? If we turned that 
>> off and had gobs more swap available for ES, would it not blow up, but just 
>> be real slow?
>>
>>
>>
>>
>> On Mon, May 5, 2014 at 4:12 PM, Mark Walkom 
>> <[email protected]<javascript:>
>> > wrote:
>>
>>> Then you need more nodes, more heap on existing nodes or less data.
>>> You've reached the limit of what your current cluster can handle, that 
>>> is why this is happening.
>>>  
>>> Regards,
>>> Mark Walkom
>>>
>>> Infrastructure Engineer
>>> Campaign Monitor
>>> email: [email protected] <javascript:>
>>> web: www.campaignmonitor.com
>>>
>>>
>>> On 6 May 2014 09:11, Nate Fox <[email protected] <javascript:>> wrote:
>>>
>>>>  I have 11 nodes. 3 are dedicated masters and the other 8 are data 
>>>> nodes. 
>>>> On May 5, 2014 4:03 PM, "[email protected] <javascript:>" <
>>>> [email protected] <javascript:>> wrote:
>>>>
>>>>> You have only two nodes it seems. Adding nodes may help.
>>>>>
>>>>> Beside data nodes that do the heavy work, set up 3 master eligible 
>>>>> nodes (data-less nodes, with reasonable smaller heap size for cluster 
>>>>> state 
>>>>> and mappings). Set the other data nodes to non-eligible for becoming 
>>>>> master.
>>>>>
>>>>> Jörg
>>>>>
>>>>>
>>>>> On Mon, May 5, 2014 at 9:34 PM, Nate Fox <[email protected]<javascript:>
>>>>> > wrote:
>>>>>
>>>>>> We're using ES 1.1.0 for central logging storage/searching. When we 
>>>>>> use Kibana and search a month's worth of data, our cluster becomes 
>>>>>> unresponsive. By unresponsive I mean that many nodes will respond 
>>>>>> immediately to a 'curl localhost:9200' but a couple will not. This leads 
>>>>>> to 
>>>>>> any cluster metrics not being available when quering the master and 
>>>>>> we're 
>>>>>> unable to set any cluster-level settings.
>>>>>>
>>>>>> We're getting a these types of errors in the logs:
>>>>>> [2014-05-05 19:10:50,763][WARN ][transport.netty          ] 
>>>>>> [Leap-Frog] exception caught on transport layer [[id: 0x4b074069, /
>>>>>> 10.6.10.211:57563 => /10.6.10.148:9300]], closing connection
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>
>>>>>> The cluster seems to never recover either - and that is my biggest 
>>>>>> concern. So my questions are:
>>>>>> 1. Is it normal for the entire cluster to just close up shop because 
>>>>>> a couple nodes are unresponsive? I thought the field data circuit 
>>>>>> breaker 
>>>>>> would fix this, but maybe this is a different problem.
>>>>>> 2. How to best get ES to recover from this scenario? I dont really 
>>>>>> want to restart just the two nodes, as we have >1Tb of data on each 
>>>>>> node, 
>>>>>> but issuing a disable_allocation fails because it cannot write to all 
>>>>>> nodes 
>>>>>> in the cluster
>>>>>>
>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected] <javascript:>.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/2fb4e427-cf95-4882-bd87-728fbfef10dd%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>  -- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "elasticsearch" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe
>>>>> .
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> [email protected] <javascript:>.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE2EXjGifcjQkvhE1NmeEnHUJO%3Dr-iB7E%3DLzY-jxz%2BAAw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected] <javascript:>.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAHU4sP8-t%3Duqie%3Dz5pjb2ab7Te51q%2BWAac7pECeJrM%3DyrnDT7w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "elasticsearch" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/elasticsearch/pNgeukzPL3A/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> [email protected] <javascript:>.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/CAEM624Z4xuDmxXsL-fFP3rB4X%3D5_dwZpoeTZ6kfFg6VkE0db7A%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAEM624Z4xuDmxXsL-fFP3rB4X%3D5_dwZpoeTZ6kfFg6VkE0db7A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/CAHU4sP-eCbbWu%2B01EuTwmaYwZBwN_REPUS0QpwdfpMraY57Y0w%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAHU4sP-eCbbWu%2B01EuTwmaYwZBwN_REPUS0QpwdfpMraY57Y0w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f24e33df-9327-479b-8e09-33887cd8e0bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Unresponsive cluster after too large of a query (OutOfMemoryError: Java heap space)?

Reply via email to