Thanks for the reply. Today the problem happened again. A bad node stop responding and bring down the whole cluster. But this time memory is ok. Here are some details. 1. Again, management APIs such as _node, cat are not returning. Default the default 9200 response. If I directly hit the master node, default 9200 is returning 200. But the other APIs are not working. 2. No out of memory exception. We set HEAP at 20GB, the but the usage is about 15GB only. ( Could it be because of this? Machine is 32GB memory) 3. I restarted a couple of high memory nodes, and master too, still not recovering. Until I found on some master node logs pointing to a node saying operation cannot be executed on bad node. 4. Again, the bad node's log is missing an entire time period since a couple of hours ago. And in Marvel, the node stopped reporting status around the same time too. Didn't see anything suspicious on Marvel events though. Unlike the first time, there's no obvious problem(didn't see GC log) except some index operation failing. And this time I checked the field_data size too, it's not big, around 1GB only.
What can i do to pinpoint what's going on? On Sun, Feb 8, 2015 at 11:11 AM, Adrien Grand < [email protected]> wrote: > Indeed JVMs sometimes need to "stop the world" in case of memory pressure. > You might find some advices about GC tuning here or there but these but I > would advise to avoid it as it is very hard to evaluate the impact of these > settings. > > If this issue happens on a regular basis, it might mean that your cluster > is undersized and should be given more memory so that the JVM doesn't have > to run full GCs so often. Otherwise, you should look at how you could > modify elasticsearch's configuration in order to load less stuff in memory > (such as using doc values for fielddata). Another option is to run two > nodes instead of one per machine (with half the memory). Given that full > gcs are shorter on small heaps, this should limit the issue. > > On Sat, Feb 7, 2015 at 2:55 AM, liu wei <[email protected]> wrote: > >> Hi, >> >> We recently had a few incidents where a single index with low memory is >> impacting the entire cluster. All the cluster related APIs are not >> responding. Kibana 3 and 4 are failing to load too. From log it seems it's >> doing GC and not responding to any requests. And there's no log between >> 2:29 to 4:07 where i restarted the node. Is there anyway to make this more >> resilient? >> >> *[2015-02-05 14:29:17,199][INFO ][monitor.jvm ] [Big Wheel] >> [gc][young][78379][36567] duration [864ms], collections [1]/[1.7s], total >> [864ms]/[1.4h], memory [15.2gb]->[14.6gb]/[19.9gb], all_pools {[young] >> [599.8mb]->[2.8mb]/[665.6mb]}{[survivor] [75.6mb]->[83.1mb]/[83.1mb]}{[old] >> [14.5gb]->[14.5gb]/[19.1gb]}* >> >> *[2015-02-05 14:29:23,302][WARN ][monitor.jvm ] [Big Wheel] >> [gc][young][78384][36568] duration [1.4s], collections [1]/[2s], total >> [1.4s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young] >> [459.7mb]->[15.7mb]/[665.6mb]}{[survivor] >> [83.1mb]->[83.1mb]/[83.1mb]}{[old] [14.5gb]->[14.6gb]/[19.1gb]}* >> >> *[2015-02-05 14:29:34,990][INFO ][monitor.jvm ] [Big Wheel] >> [gc][young][78395][36571] duration [900ms], collections [1]/[1.4s], total >> [900ms]/[1.4h], memory [15.1gb]->[14.6gb]/[19.9gb], all_pools {[young] >> [484.9mb]->[3.9mb]/[665.6mb]}{[survivor] [71.7mb]->[52.4mb]/[83.1mb]}{[old] >> [14.6gb]->[14.6gb]/[19.1gb]}* >> >> *[2015-02-05 14:29:45,055**][WARN ][monitor.jvm ] [Big >> Wheel] [gc][young][78404][36574] duration [1.2s], collections [1]/[2s], >> total [1.2s]/[1.4h], memory [15.1gb]->[14.7gb]/[19.9gb], all_pools {[young] >> [472.8mb]->[2.9mb]/[665.6mb]}{[survivor] [83.1mb]->[67.6mb]/[83.1mb]}{[old] >> [14.6gb]->[14.6gb]/[19.1gb]}* >> >> *[2015-02-05 16:07:15,509][**INFO ][node ] [Pyro] >> version[1.4.2], pid[9796], build[927caff/2014-12-16T14:11:12Z]* >> >> *[2015-02-05 16:07:15,510][INFO ][node ] [Pyro] >> initializing ...* >> >> *[2015-02-05 16:07:15,638][INFO ][plugins ] [Pyro] >> loaded [marvel, cloud-azure], sites [marvel, kopf]* >> >> *[2015-02-05 16:07:24,844][INFO ][node ] [Pyro] >> initialized* >> >> *[2015-02-05 16:07:24,845][INFO ][node ] [Pyro] >> starting ...* >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com >> <https://groups.google.com/d/msgid/elasticsearch/39c89f64-8614-4c4d-bef4-420a5a9eae46%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Adrien Grand > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/O9pkFK5eMJ0/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com > <https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7VWW%3DB9hYsR95yHT1fcPyna29c9KrOOWcpV_qwEDkUUQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAFzuQNRnXr3nq-tfGtq4zoJpYPr2A7dxP%2B2CiEFgwsLR9A%2BS-w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
