Todd Lipcon has submitted this change and it was merged. Change subject: KUDU-2144. Add metrics for Reactor load ......................................................................
KUDU-2144. Add metrics for Reactor load This adds two new metrics: 1) reactor_load_percent This measures the percentage of time that a reactor spends doing active work (i.e not blocked in epoll_wait()). As this approaches 100%, it indicates that the server may be reactor-bound (eg due to skew, a performance bug, or insufficient number of reactor threads) At first glance, it might seem like this should just be a simple counter of cycles spent in epoll_wait(), so that metrics systems would calculate a derived cycles/second rate. However, a simple rate metric would not well represent the fact that there are multiple reactor threads, and it's possible (and often the case) that only one of them is highly loaded due to skew. Additionally, there are some bursty workloads where the load average over minute granularity is relatively low, but when viewed on a short time scale the reactor thread approaches 100% load. So, to capture that, this new metric is a histogram of percentages. Values are contributed to the histogram in the periodic TimerHandler which runs every 100ms. So, if there is any 100ms period in which the reactor is fully loaded, it will contribute a "100%" sample to the histogram. We can then inspect the percentiles and the raw counts to see if there are even any short bursts where the reactors are the bottleneck. To test this out, I ran rpc-bench and modified the number of server reactors. As I added more server reactors, I could see that the load percentage (particularly the 95th percentile) went down as each reactor was able to spend more time idle. 2) reactor_active_latency_us This metric takes another approach to measure potential latency issues caused by reactors by measuring the histogram of time spent invoking watcher callbacks. If, for example, we see that callback execution frequently takes 100ms, we can assume that a similar amount of latency would be contributed to any inbound or outbound RPCs associated with the same reactor. To test this out, I simulated KUDU-1944 (OpenSSL lock contention slowing down socket IO) by adding a usleep(10ms) call in Socket::Recv and running rpc-bench. I could see that the new metric shot up accordingly. Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb Reviewed-on: http://gerrit.cloudera.org:8080/8064 Tested-by: Kudu Jenkins Reviewed-by: David Ribeiro Alves <[email protected]> --- M src/kudu/rpc/reactor.cc M src/kudu/rpc/reactor.h M src/kudu/rpc/rpc-bench.cc M src/kudu/util/metrics.h 4 files changed, 133 insertions(+), 2 deletions(-) Approvals: David Ribeiro Alves: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8064 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]>
