Andrew Wong created KUDU-3060:
---------------------------------
Summary: Add a tool to identify potential performance bottlenecks
Key: KUDU-3060
URL: https://issues.apache.org/jira/browse/KUDU-3060
Project: Kudu
Issue Type: Improvement
Components: CLI, perf, ui
Reporter: Andrew Wong
When we hear users wondering why their workloads are slower than expected, some
common questions arise. It'd be great if we had a single tool (or a single
webpage) that aggregated and displayed useful information for a specific tablet
or table. Things like, for a specific table:
- How many partitions and replicas exist for the table.
- For those replicas, how they are distributed across tablet servers.
- For those tablet servers, what the block cache configuration is, and what the
current block cache stats (hit ratio, evictions, etc) are.
- For those tablet servers, which tablets have been written to recently.
- For those tablet servers, which tablets within the target table have been
written to recently.
- For those tablet servers, how many active and non-expired scanners exist.
- For those tablet servers, which tablets within the target table have been
read from recently.
- For those tablet servers, how many ongoing tablet copies there are both to
and from the server.
- For those tablet servers, how many data directories there are.
- For the data directories on those tablet servers, how many replicas are
spreading data in each directory, how many blocks there are in each, and how
much space is available in each.
The list could go on and on. It probably makes sense to break the diagnostics
into different phases or goals, maybe along the lines of 1) identifying
hotspots of workloads and lag across tablet servers (e.g. a ton of writes going
to a single tserver), and 2) digging into a single tablet server to understand
how it's provisioned and whether that provisioning is sufficient.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)