Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-2144. Add metrics for Reactor load
......................................................................


KUDU-2144. Add metrics for Reactor load

This adds two new metrics:

1) reactor_load_percent

This measures the percentage of time that a reactor spends doing active
work (i.e not blocked in epoll_wait()). As this approaches 100%, it
indicates that the server may be reactor-bound (eg due to skew, a
performance bug, or insufficient number of reactor threads)

At first glance, it might seem like this should just be a simple counter
of cycles spent in epoll_wait(), so that metrics systems would calculate
a derived cycles/second rate. However, a simple rate metric would not
well represent the fact that there are multiple reactor threads, and
it's possible (and often the case) that only one of them is highly
loaded due to skew. Additionally, there are some bursty workloads where
the load average over minute granularity is relatively low, but when
viewed on a short time scale the reactor thread approaches 100% load.

So, to capture that, this new metric is a histogram of percentages.
Values are contributed to the histogram in the periodic TimerHandler
which runs every 100ms. So, if there is any 100ms period in which the
reactor is fully loaded, it will contribute a "100%" sample to the
histogram. We can then inspect the percentiles and the raw counts to see
if there are even any short bursts where the reactors are the
bottleneck.

To test this out, I ran rpc-bench and modified the number of server
reactors. As I added more server reactors, I could see that the load
percentage (particularly the 95th percentile) went down as each reactor
was able to spend more time idle.

2) reactor_active_latency_us

This metric takes another approach to measure potential latency issues
caused by reactors by measuring the histogram of time spent invoking
watcher callbacks. If, for example, we see that callback execution
frequently takes 100ms, we can assume that a similar amount of latency
would be contributed to any inbound or outbound RPCs associated with the
same reactor.

To test this out, I simulated KUDU-1944 (OpenSSL lock contention slowing
down socket IO) by adding a usleep(10ms) call in Socket::Recv and
running rpc-bench. I could see that the new metric shot up accordingly.

Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
Reviewed-on: http://gerrit.cloudera.org:8080/8064
Tested-by: Kudu Jenkins
Reviewed-by: David Ribeiro Alves <[email protected]>
---
M src/kudu/rpc/reactor.cc
M src/kudu/rpc/reactor.h
M src/kudu/rpc/rpc-bench.cc
M src/kudu/util/metrics.h
4 files changed, 133 insertions(+), 2 deletions(-)

Approvals:
  David Ribeiro Alves: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/8064
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ic530af2836b1c31b3d754a9e0068fc5d31aa6fbb
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>

Reply via email to