[
https://issues.apache.org/jira/browse/KUDU-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236373#comment-15236373
]
Todd Lipcon commented on KUDU-1410:
-----------------------------------
Some particular items I think we should implement:
h1. "Exemplar" traces
- keep a couple of sample traces of each RPC type, bucketed by percentile. eg
keep a random RPC that falls between 25th and 75 percentile, and another random
RPC which is above 95th percentile.
The idea here is that often RPCs are slow, but not slow enough to cause a
timeout. This means we don't get the trace dumped in the logs. But, users may
still think that the server is behaving slowly, and it would be useful to be
able to go to a web page to get a dump with some recent slow requests.
h1. Per-request metrics
-add more accounting of resource usage/timing to the per-RPC traces --
something like per-request metrics
The idea here is that we often are judicious with what we are tracing with the
TRACE() macro because it has a measurable overhead. For example,
ac3771f4078c1f23545494f63384c724f73cc0af was an optimization that changed from
tracing write-side "ProbeStats" per-op to tracing them once per batch. This
resulted in an almost-20% speedup of a write benchmark. In that case, we did
the aggregation using some ad-hoc code to manage the aggregation of the stats
in the request.
However, there are lots of cases similar to this. For example, in a Scan RPC,
we may access thousands of individual cfile blocks. It would be great to expose
on a per-RPC basis metrics like the cache hit/miss count, the number of bytes
read, the number of cfile reads which were slower than some threshold (eg
indicating a seek), etc.
One downside of the "just pass a statistics structure like ProbeStats through
the call stack" approach is that it doesn't interface well with more generic
pieces of code like spinlock contention. It would be useful for a slow RPC to
include a count of how many cycles it spent blocked on spinlocks/mutexes, etc.
This suggests that the counting should be somehow associated with a threadlocal
(eg as part of a Trace* object or as its own structure)
In combination with the above "exemplar traces", these stats should make it
fairly easy to get a good idea why a certain type of RPCs are slow on a running
server..
> Improve diagnosability of performance problems
> ----------------------------------------------
>
> Key: KUDU-1410
> URL: https://issues.apache.org/jira/browse/KUDU-1410
> Project: Kudu
> Issue Type: Bug
> Components: supportability
> Affects Versions: 0.8.0
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
>
> Although Kudu has been relatively stable for most users, we are starting to
> see more and more questions about performance. In internal test clusters
> we're also struggling to understand performance issues or timeouts in some
> cases from logs only, and it can require gathering a daemon trace to see
> what's going on.
> This is an umbrella ticket for various improvements we can make so that
> performance is easier to understand.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)