Re: Single low memory data node impact whole cluster

liu wei Thu, 12 Feb 2015 19:42:07 -0800

Thanks for the reply. Today the problem happened again. A bad node stop
responding and bring down the whole cluster. But this time memory is ok.
Here are some details.
1. Again, management APIs such as _node, cat are not returning. Default the
default 9200 response. If I directly hit the master node, default 9200 is
returning 200. But the other APIs are not working.
2. No out of memory exception. We set HEAP at 20GB, the but the usage is
about 15GB only. ( Could it be because of this? Machine is 32GB memory)
3. I restarted a couple of high memory nodes, and master too, still not
recovering. Until I found on some master node logs pointing to a node
saying operation cannot be executed on bad node.
4. Again, the bad node's log is missing an entire time period since a
couple of hours ago. And in Marvel, the node stopped reporting status
around the same time too. Didn't see anything suspicious on Marvel events
though. Unlike the first time, there's no obvious problem(didn't see GC
log) except some index operation failing. And this time I checked the
field_data size too, it's not big, around 1GB only.


What can i do to pinpoint what's going on?


On Sun, Feb 8, 2015 at 11:11 AM, Adrien Grand <
[email protected]> wrote:

> Indeed JVMs sometimes need to "stop the world" in case of memory pressure.
> You might find some advices about GC tuning here or there but these but I
> would advise to avoid it as it is very hard to evaluate the impact of these
> settings.
>
> If this issue happens on a regular basis, it might mean that your cluster
> is undersized and should be given more memory so that the JVM doesn't have
> to run full GCs so often. Otherwise, you should look at how you could
> modify elasticsearch's configuration in order to load less stuff in memory
> (such as using doc values for fielddata). Another option is to run two
> nodes instead of one per machine (with half the memory). Given that full
> gcs are shorter on small heaps, this should limit the issue.
>
> On Sat, Feb 7, 2015 at 2:55 AM, liu wei <[email protected]> wrote:
>
>> Hi,
>>
>> We recently had a few incidents where a single index with low memory is
>> impacting the entire cluster. All the cluster related APIs are not
>> responding. Kibana 3 and 4 are failing to load too. From log it seems it's
>> doing GC and not responding to any requests. And there's no log between
>> 2:29 to 4:07 where i restarted the node. Is there anyway to make this more
>> resilient?
>>
>> *[2015-02-05 14:29:17,199][INFO ][monitor.jvm              ] [Big Wheel]
>> [gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total
>> [864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young]
>> [599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old]
>> [14.5gb]->[14.5gb]/[19.1gb]}*
>>
>> *[2015-02-05 14:29:23,302][WARN ][monitor.jvm              ] [Big Wheel]
>> [gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total
>> [1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
>> [459.7mb]->[15.7mb]/[665.6mb]}{[survivor]
>> [83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}*
>>
>> *[2015-02-05 14:29:34,990][INFO ][monitor.jvm              ] [Big Wheel]
>> [gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total
>> [900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young]
>> [484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old]
>> [14.6gb]->[14.6gb]/[19.1gb]}*
>>
>> *[2015-02-05 14:29:45,055**][WARN ][monitor.jvm              ] [Big
>> Wheel] [gc][young][78404][36574] duration [1.2s], collections [1]/[2s],
>> total [1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young]
>> [472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old]
>> [14.6gb]->[14.6gb]/[19.1gb]}*
>>
>> *[2015-02-05 16:07:15,509][**INFO ][node                     ] [Pyro]
>> version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]*
>>
>> *[2015-02-05 16:07:15,510][INFO ][node                     ] [Pyro]
>> initializing ...*
>>
>> *[2015-02-05 16:07:15,638][INFO ][plugins                  ] [Pyro]
>> loaded [marvel, cloud-azure], sites [marvel, kopf]*
>>
>> *[2015-02-05 16:07:24,844][INFO ][node                     ] [Pyro]
>> initialized*
>>
>> *[2015-02-05 16:07:24,845][INFO ][node                     ] [Pyro]
>> starting ...*
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com
>> <https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
> Adrien Grand
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/O9pkFK5eMJ0/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAFzuQNRnXr3nq-tfGtq4zoJpYPr2A7dxP%2B2CiEFgwsLR9A%2BS-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Single low memory data node impact whole cluster

Reply via email to