[ https://issues.apache.org/jira/browse/SOLR-15056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263829#comment-17263829 ]
Walter Underwood commented on SOLR-15056: ----------------------------------------- CPU utilization and load average (run queue length) are complimentary metrics. CPU utilization measures how much work the CPUs are doing. The load average measures how much work is waiting for them (how much they are not doing). Under 100% CPU, load average doesn't tell us much, but CPU usage is very useful. Over 100% CPU, CPU utilization doesn't tell us much, but load average tells us a lot. It tells us how much work is waiting to run. Load average spikes after the CPU is very busy, something like 90%. When load average rises, Solr will already be overloaded and service will already be slowing down. If Solr is running on a system that includes iowait in the load average, then it could be useful to have circuit breakers on both the CPU usage and the load average. The latter would tell when the storage or network is overloaded. I'm assuming that the hosts are configured with enough RAM so that normal searching doesn't hit disk. That makes Solr CPU-limited. All of our 100+ Solr systems are configured that way. > CPU circuit breaker needs to use CPU utilization, not Unix load average > ----------------------------------------------------------------------- > > Key: SOLR-15056 > URL: https://issues.apache.org/jira/browse/SOLR-15056 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: metrics > Affects Versions: 8.7 > Reporter: Walter Underwood > Priority: Major > Labels: Metrics > Attachments: SOLR-15056.patch > > > The config range, 50% to 95%, assumes that the circuit breaker is triggered > by a CPU utilization metric that goes from 0% to 100%. But the code uses the > metric OperatingSystemMXBean.getSystemLoadAverage(). That is an average of > the count of processes waiting to run. It is effectively unbounded. I've seen > it as high as 50 to 100. It is not bound by 1.0 (100%). > A good limit for load average would need to be aware of the number of CPUs > available to the JVM. A load average of 8 is no problem for a 32 CPU host. It > is a critical situation for a 2 CPU host. > Also, load average is a Unix OS metric. I don't know if it is even available > on Windows. > Instead, use a CPU utilization metric that goes from 0.0 to 1.0. A good > choice is OperatingSystemMXBean.getSystemCPULoad(). This name also uses > "load", but it is a usage metric. > From the Javadoc: > > Returns the "recent cpu usage" for the whole system. This value is a double > >in the [0.0,1.0] interval. A value of 0.0 means that all CPUs were idle > >during the recent period of time observed, while a value of 1.0 means that > >all CPUs were actively running 100% of the time during the recent period > >being observed. All values betweens 0.0 and 1.0 are possible depending of > >the activities going on in the system. If the system recent cpu usage is not > >available, the method returns a negative value. > https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getSystemCpuLoad() > Also update the documentation to explain which JMX metrics are used for the > memory and CPU circuit breakers. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org