[ 
https://issues.apache.org/jira/browse/KUDU-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236373#comment-15236373
 ] 

Todd Lipcon commented on KUDU-1410:
-----------------------------------

Some particular items I think we should implement:

h1. "Exemplar" traces

- keep a couple of sample traces of each RPC type, bucketed by percentile. eg 
keep a random RPC that falls between 25th and 75 percentile, and another random 
RPC which is above 95th percentile.

The idea here is that often RPCs are slow, but not slow enough to cause a 
timeout. This means we don't get the trace dumped in the logs. But, users may 
still think that the server is behaving slowly, and it would be useful to be 
able to go to a web page to get a dump with some recent slow requests.

h1. Per-request metrics
-add more accounting of resource usage/timing to the per-RPC traces -- 
something like per-request metrics

The idea here is that we often are judicious with what we are tracing with the 
TRACE() macro because it has a measurable overhead. For example, 
ac3771f4078c1f23545494f63384c724f73cc0af was an optimization that changed from 
tracing write-side "ProbeStats" per-op to tracing them once per batch. This 
resulted in an almost-20% speedup of a write benchmark. In that case, we did 
the aggregation using some ad-hoc code to manage the aggregation of the stats 
in the request.

However, there are lots of cases similar to this. For example, in a Scan RPC, 
we may access thousands of individual cfile blocks. It would be great to expose 
on a per-RPC basis metrics like the cache hit/miss count, the number of bytes 
read, the number of cfile reads which were slower than some threshold (eg 
indicating a seek), etc.

One downside of the "just pass a statistics structure like ProbeStats through 
the call stack" approach is that it doesn't interface well with more generic 
pieces of code like spinlock contention. It would be useful for a slow RPC to 
include a count of how many cycles it spent blocked on spinlocks/mutexes, etc. 
This suggests that the counting should be somehow associated with a threadlocal 
(eg as part of a Trace* object or as its own structure)

In combination with the above "exemplar traces", these stats should make it 
fairly easy to get a good idea why a certain type of RPCs are slow on a running 
server..



> Improve diagnosability of performance problems
> ----------------------------------------------
>
>                 Key: KUDU-1410
>                 URL: https://issues.apache.org/jira/browse/KUDU-1410
>             Project: Kudu
>          Issue Type: Bug
>          Components: supportability
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>
> Although Kudu has been relatively stable for most users, we are starting to 
> see more and more questions about performance. In internal test clusters 
> we're also struggling to understand performance issues or timeouts in some 
> cases from logs only, and it can require gathering a daemon trace to see 
> what's going on.
> This is an umbrella ticket for various improvements we can make so that 
> performance is easier to understand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to