Todd Lipcon created KUDU-2144:
---------------------------------
Summary: Add metric for reactor load
Key: KUDU-2144
URL: https://issues.apache.org/jira/browse/KUDU-2144
Project: Kudu
Issue Type: Improvement
Components: metrics, ops-tooling
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Recently I was debugging a cluster that appeared to have network issues. Only
after lots of investigation did I realize that the reactor threads were not
keeping up with network traffic due to hitting KUDU-1964 (this cluster was
running 1.3.0). At first glance the reactors did not seem busy, since each was
only using ~25% of a CPU -- however, the other 75% of the time was spent
blocked on OpenSSL locks and not in epoll_wait as one would normally expect.
This would be easier to diagnose if we had a metric showing the amount of time
the reactors spend idle (ie in epoll_wait) vs doing work (ie executing
callbacks, IO, etc). If any reactor is spending a high percentage of time not
in epoll, that suggests the reactors may be a bottleneck and increasing latency
or degrading throughput.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)