Hey All, I've got a cluster with 5 data nodes (2 master nodes). The cluster has ~100 indices, w/ doc counts in the 1k - 50k range. There is a low/medium amount of index load going into the cluster via the bulk api and a large amount of search traffic going in in the 40K queries per second range.
I'm running these data nodes on ec2 (c3.8xl's) with a 30GB heap, though at the time of the following sample I was testing out running with a 20GB heap. The process runs well for a while, a couple hours to a day or two depending on traffic, and then it get's into a bad state where there is continual doing long gc runs, ie every minute doing a stop the world run for 30-45sec, and seemingly getting very little out of it (ie starting with 18.8GB heap usage and going to 18.3GB heap usage). Here the red line is a data node that is exhibiting the behavior. This is a graph of the "old" generation growing to nearly the complete heap size and then staying there for hours. During this time the application is severely degraded. <https://lh4.googleusercontent.com/-JXEVIJVBDDY/U8kyUY7hhyI/AAAAAAAACBo/dceW7JJGKiA/s1600/Screen+Shot+2014-07-18+at+10.37.44+AM.png> Example of one of the gc runs during this time (again they run every minute or so). [2014-07-18 00:24:24,735][WARN ][monitor.jvm ] [prod-targeting-es2] [gc][old][10799][27] duration [41.5s], collections [1]/[42.5s], total [41.5s]/[2.2m], memory [18.8gb]->[18.3gb]/[19.8gb], all_pools {[young] [733.2mb]->[249.9mb]/[1.4gb]}{[survivor] [86mb]->[0b]/[191.3mb]}{[old] [18gb]->[18.1gb]/[18.1gb]} We are running es 1.2.2 . We had been running Oracle 7u25 and we've tried upgrading to 7u65 with no effect. I just did a heap dump analysis using jmap and Eclipse Memory Analyzer and found that 85% of the heap was taken up with filter cache <https://lh4.googleusercontent.com/-KZ8SJD-o32M/U8kzdtC0KhI/AAAAAAAACBw/TeWTvmOc1rc/s1600/Screen+Shot+2014-07-18+at+1.34.44+AM.png> We are doing a lot of "bool" conditions in our queries, so that may be a factor in the hefty filter cache. Any ideas out there? Right now I have to bounce my data nodes every hour or two to ensure I don't reach this degraded state. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/288da6e7-b85a-4cbf-a83d-d777ee7c9c57%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
