[ 
https://issues.apache.org/jira/browse/KUDU-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165524#comment-16165524
 ] 

Todd Lipcon commented on KUDU-2144:
-----------------------------------

Another way to get at the same kind of info might be to measure the actual 
latency between submitting a task to the ReactorTask queue and that task 
actually being executed. If we exposed this as a histogram we would probably be 
able to see if the reactor is responding slowly due to some reason or another, 
which would lead us more quickly to start pstacking or profiling the reactor.

> Add metric for reactor load
> ---------------------------
>
>                 Key: KUDU-2144
>                 URL: https://issues.apache.org/jira/browse/KUDU-2144
>             Project: Kudu
>          Issue Type: Improvement
>          Components: metrics, ops-tooling
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> Recently I was debugging a cluster that appeared to have network issues. Only 
> after lots of investigation did I realize that the reactor threads were not 
> keeping up with network traffic due to hitting KUDU-1964 (this cluster was 
> running 1.3.0). At first glance the reactors did not seem busy, since each 
> was only using ~25% of a CPU -- however, the other 75% of the time was spent 
> blocked on OpenSSL locks and not in epoll_wait as one would normally expect.
> This would be easier to diagnose if we had a metric showing the amount of 
> time the reactors spend idle (ie in epoll_wait) vs doing work (ie executing 
> callbacks, IO, etc). If any reactor is spending a high percentage of time not 
> in epoll, that suggests the reactors may be a bottleneck and increasing 
> latency or degrading throughput.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to