[
https://issues.apache.org/jira/browse/KUDU-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165503#comment-16165503
]
Todd Lipcon commented on KUDU-2144:
-----------------------------------
One slight wrinkle for this metric is that, if there are multiple reactors,
there may be skew such that only one is "overloaded". We should still expose
this somehow rather than exposing an average across the reactors.
> Add metric for reactor load
> ---------------------------
>
> Key: KUDU-2144
> URL: https://issues.apache.org/jira/browse/KUDU-2144
> Project: Kudu
> Issue Type: Improvement
> Components: metrics, ops-tooling
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
>
> Recently I was debugging a cluster that appeared to have network issues. Only
> after lots of investigation did I realize that the reactor threads were not
> keeping up with network traffic due to hitting KUDU-1964 (this cluster was
> running 1.3.0). At first glance the reactors did not seem busy, since each
> was only using ~25% of a CPU -- however, the other 75% of the time was spent
> blocked on OpenSSL locks and not in epoll_wait as one would normally expect.
> This would be easier to diagnose if we had a metric showing the amount of
> time the reactors spend idle (ie in epoll_wait) vs doing work (ie executing
> callbacks, IO, etc). If any reactor is spending a high percentage of time not
> in epoll, that suggests the reactors may be a bottleneck and increasing
> latency or degrading throughput.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)