[
https://issues.apache.org/jira/browse/CASSANDRA-8720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718911#comment-17718911
]
Andres de la Peña commented on CASSANDRA-8720:
----------------------------------------------
Thanks :)
The data in the table comes from
[{{EstimatedHistogram}}|https://github.com/apache/cassandra/blob/d2923275e360a1ee9db498e748c269f701bb3a8b/src/java/org/apache/cassandra/utils/EstimatedHistogram.java]s,
like the ones we use on metrics such as the ones on
[{{MetadataCollector}}|https://github.com/apache/cassandra/blob/b7e1e44a909c3a1d11e9c387db680c74d31b879f/src/java/org/apache/cassandra/io/sstable/metadata/MetadataCollector.java#L60-L71].
Histograms do sampling and they don't provide entirely accurate results.
However, it would be easy to track the min and max metrics so they are exact.
Indeed max seems particularly important if when trying to detect large
partitions.
I have updated the patch to do that exact calculation of min/max. I have also
added a tilde (~) prefix to the percentiles coming from the histogram in an
attempt to indicate that they are approximate. So the output of the tool now
looks like:
{code:java}
> sstablepartitions
> data/data/k/t-d7be5e90e90111ed8b54efe3c39cb0bb/nc-8-big-Data.db --min-size
> 100MiB
Processing #8 (big-nc) (1.368 GiB uncompressed, 534.979 MiB on disk)
Partition: '13' (0000000d) live, size: 105.056 MiB, rows: 91490, cells:
274470, tombstones: 50 (row:50, range:0, complex:0, cell:0, row-TTLd:0,
cell-TTLd:0)
Partition: '1' (00000001) live, size: 127.241 MiB, rows: 111065, cells:
333195, tombstones: 50 (row:50, range:0, complex:0, cell:0, row-TTLd:0,
cell-TTLd:0)
Partition: '8' (00000008) live, size: 356.067 MiB, rows: 310706, cells:
932118, tombstones: 0 (row:0, range:0, complex:0, cell:0, row-TTLd:0,
cell-TTLd:0)
Partition: '2' (00000002) live, size: 213.341 MiB, rows: 186582, cells:
559125, tombstones: 978 (row:978, range:0, complex:0, cell:0, row-TTLd:0,
cell-TTLd:0)
Summary of #8 (big-nc):
File: /Users/adelapena/Desktop/sstablepartitions/nc-8-big-Data.db
4 partitions match
Keys: 13 1 8 2
Partition size Row count Cell count
Tombstone count
~p50 767.519 KiB 770 1916
0
~p75 2.238 MiB 2299 5722
0
~p90 3.867 MiB 3311 9887
50
~p95 16.629 MiB 14237 42510
446
~p99 148.267 MiB 126934 379022
1331
~p999 368.936 MiB 315852 943127
2759
min 56.854 KiB 100 150
0
max 356.067 MiB 310706 932118
2450
count 21
{code}
Note also that the min/max rows on the table don't represent a single
partition. In the previous example the max size of 356.067 MiB comes from
partition '8', whereas the max number of tombstones (2450) comes from other
partition. That partition is not listed because its size is below the 100MiB
threshold. We can find the key of that partition if we add a `--min-tombstones`
threshold to the command:
{code:java}
> sstablepartitions
> data/data/k/t-d7be5e90e90111ed8b54efe3c39cb0bb/nc-8-big-Data.db --min-size
> 100MiB --min-tombstones 2000
Processing #8 (big-nc) (1.368 GiB uncompressed, 534.979 MiB on disk)
Partition: '13' (0000000d) live, size: 105.056 MiB, rows: 91490, cells:
274470, tombstones: 50 (row:50, range:0, complex:0, cell:0, row-TTLd:0,
cell-TTLd:0)
Partition: '1' (00000001) live, size: 127.241 MiB, rows: 111065, cells:
333195, tombstones: 50 (row:50, range:0, complex:0, cell:0, row-TTLd:0,
cell-TTLd:0)
Partition: '8' (00000008) live, size: 356.067 MiB, rows: 310706, cells:
932118, tombstones: 0 (row:0, range:0, complex:0, cell:0, row-TTLd:0,
cell-TTLd:0)
Partition: '2' (00000002) live, size: 213.341 MiB, rows: 186582, cells:
559125, tombstones: 978 (row:978, range:0, complex:0, cell:0, row-TTLd:0,
cell-TTLd:0)
Partition: '21' (00000015) live, size: 3.853 MiB, rows: 4900, cells: 9927,
tombstones: 2450 (row:2450, range:0, complex:0, cell:0, row-TTLd:0, cell-TTLd:0)
Summary of #8 (big-nc):
File: /Users/adelapena/Desktop/sstablepartitions/nc-8-big-Data.db
5 partitions match
Keys: 13 1 8 2 21
Partition size Row count Cell count
Tombstone count
~p50 767.519 KiB 770 1916
0
~p75 2.238 MiB 2299 5722
0
~p90 3.867 MiB 3311 9887
50
~p95 16.629 MiB 14237 42510
446
~p99 148.267 MiB 126934 379022
1331
~p999 368.936 MiB 315852 943127
2759
min 56.854 KiB 100 150
0
max 356.067 MiB 310706 932118
2450
count 210
{code}
> Provide tools for finding wide row/partition keys
> -------------------------------------------------
>
> Key: CASSANDRA-8720
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8720
> Project: Cassandra
> Issue Type: Improvement
> Components: Legacy/Tools
> Reporter: J.B. Langston
> Assignee: Andres de la Peña
> Priority: Normal
> Fix For: 5.x
>
> Attachments: 8720.txt
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Multiple users have requested some sort of tool to help identify wide row
> keys. They get into a situation where they know a wide row/partition has been
> inserted and it's causing problems for them but they have no idea what the
> row key is in order to remove it.
> Maintaining the widest row key currently encountered and displaying it in
> cfstats would be one possible approach.
> Another would be an offline tool (possibly an enhancement to sstablekeys) to
> show the number of columns/bytes per key in each sstable. If a tool to
> aggregate the information at a CF-level could be provided that would be a
> bonus, but it shouldn't be too hard to write a script wrapper to aggregate
> them if not.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]