I have an ES 0.90.11 cluster with three nodes (d0, d1, d2), with 4 cores 
and 7GB memory, running Ubuntu and JDK 7u45. The ES instances are all 
master+data, configured with 3.5GB heap size. They are pretty much running 
a vanilla configuration. Logstash is currently storing on average 200 logs 
per second to the cluster, and we use kibana as a frontend. Usually when 
teh cluster is started the nodes run at around 20% cpu. However after some 
time, one or more of the nodes will jump up to around 90-100% cpu. And 
there they stay for what appears to be forever (until I tire and restart 
them).

Using "top -H" I can see that there is one thread in each elasticsearch 
process that is using most of the cpu. Here are examples from two of the 
nodes:

Node d1:

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
41969 elastic   20   0 5814m 3.5g  11m R  82.8 52.0   1036:30 java
45601 elastic   20   0 5814m 3.5g  11m S  31.9 52.0  23:02.45 java
41965 elastic   20   0 5814m 3.5g  11m S  19.1 52.0  25:25.97 java
41966 elastic   20   0 5814m 3.5g  11m S  12.7 52.0  25:25.95 java
41967 elastic   20   0 5814m 3.5g  11m S  12.7 52.0  25:23.10 java
41968 elastic   20   0 5814m 3.5g  11m S  12.7 52.0  25:23.27 java
45810 elastic   20   0 5814m 3.5g  11m S   6.4 52.0  22:59.55 java

Node d2:

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
40604 elastic   20   0 5812m 3.6g  11m R  99.9 53.2 926:23.96 java
41487 elastic   20   0 5812m 3.6g  11m S   6.5 53.2   4:35.11 java
42443 elastic   20   0 5812m 3.6g  11m S   6.5 53.2  47:03.65 java
42446 elastic   20   0 5812m 3.6g  11m S   6.5 53.2  47:05.12 java
42447 elastic   20   0 5812m 3.6g  11m S   6.5 53.2  46:38.30 java
31827 elastic   20   0 5812m 3.6g  11m S   6.5 53.2   0:00.59 java

As you can see there is one thread in each process that seems to be 
running amok. 

I have tried to use the _nodes/hot_threads API to see which thread is using 
the cpu, but I can't identify any single thread with the same cpu 
percentage that top reports. In addition, I have tried using jstack to dump 
the threads, but the stack dump doesn't even list the thread with the 
thread PID from top.

Here are a couple of charts showing the cpu user percentage:

<https://lh5.googleusercontent.com/-Clcdm5Zh5Ps/Uw4YZLI6BmI/AAAAAAAAEVE/eYINhJP3ACo/s1600/Image.png>


As you can see all the nodes went from 20% to 100% at around 3 PM. At 
midnight I got tired of waiting and restarted ES, one node at a time.

The next chart is from some hours later:

<https://lh6.googleusercontent.com/-j5Fb3d-GxHU/Uw4YdvfNbaI/AAAAAAAAEVM/3c1g-ztRA18/s1600/Image.png>


In this case the nodes' cpu usage increased at different points in time.

Cpu iowait remains low (5-10%) the whole time.

I'm thinking that maybe this behavior is triggered by large queries, but I 
don't have a specific test case that triggers it.

So, what can I do to find out what is going on? Any help would be greatly 
appreciated!

Regards,
Magnus Hyllander

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6af1c79d-8402-4de6-9ec2-07893c6b54f2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to