I have an ES 0.90.11 cluster with three nodes (d0, d1, d2), with 4 cores and 7GB memory, running Ubuntu and JDK 7u45. The ES instances are all master+data, configured with 3.5GB heap size. They are pretty much running a vanilla configuration. Logstash is currently storing on average 200 logs per second to the cluster, and we use kibana as a frontend. Usually when teh cluster is started the nodes run at around 20% cpu. However after some time, one or more of the nodes will jump up to around 90-100% cpu. And there they stay for what appears to be forever (until I tire and restart them).
Using "top -H" I can see that there is one thread in each elasticsearch process that is using most of the cpu. Here are examples from two of the nodes: Node d1: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 41969 elastic 20 0 5814m 3.5g 11m R 82.8 52.0 1036:30 java 45601 elastic 20 0 5814m 3.5g 11m S 31.9 52.0 23:02.45 java 41965 elastic 20 0 5814m 3.5g 11m S 19.1 52.0 25:25.97 java 41966 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:25.95 java 41967 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.10 java 41968 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.27 java 45810 elastic 20 0 5814m 3.5g 11m S 6.4 52.0 22:59.55 java Node d2: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 40604 elastic 20 0 5812m 3.6g 11m R 99.9 53.2 926:23.96 java 41487 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 4:35.11 java 42443 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:03.65 java 42446 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:05.12 java 42447 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 46:38.30 java 31827 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 0:00.59 java As you can see there is one thread in each process that seems to be running amok. I have tried to use the _nodes/hot_threads API to see which thread is using the cpu, but I can't identify any single thread with the same cpu percentage that top reports. In addition, I have tried using jstack to dump the threads, but the stack dump doesn't even list the thread with the thread PID from top. Here are a couple of charts showing the cpu user percentage: <https://lh5.googleusercontent.com/-Clcdm5Zh5Ps/Uw4YZLI6BmI/AAAAAAAAEVE/eYINhJP3ACo/s1600/Image.png> As you can see all the nodes went from 20% to 100% at around 3 PM. At midnight I got tired of waiting and restarted ES, one node at a time. The next chart is from some hours later: <https://lh6.googleusercontent.com/-j5Fb3d-GxHU/Uw4YdvfNbaI/AAAAAAAAEVM/3c1g-ztRA18/s1600/Image.png> In this case the nodes' cpu usage increased at different points in time. Cpu iowait remains low (5-10%) the whole time. I'm thinking that maybe this behavior is triggered by large queries, but I don't have a specific test case that triggers it. So, what can I do to find out what is going on? Any help would be greatly appreciated! Regards, Magnus Hyllander -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6af1c79d-8402-4de6-9ec2-07893c6b54f2%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
