[ 
https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747786#comment-15747786
 ] 

Sylvain Lebresne commented on CASSANDRA-13038:
----------------------------------------------

bq. What about this patch ? Basically it also doesn't update the tombstone 
histogram for expiring cells.

I admit I don't follow the reasoning here. You can't remove functionality just 
because it's inconvenient to you performance wise. We *want* to include 
expiring columns in {{estimatedTombstoneDropTime}} because that's what allows 
us to pro-actively compact sstables that mostly consists of expired (and 
gc_able) data. Expiring columns is in fact the main reason for 
{{estimatedTombstoneDropTime}} to exist, as the optimization it's used for 
kicks in much more frequently in the case of lots of expired data than in the 
case of pure tombstones (you need some pretty special workload to end up 
regularly with sstable almost entirely comprised of pure tombstones).

I think the right way to do something about this is to improve 
{{StreamingHistogram}} (or at least the usage of it), and looking at it's code, 
it does look pretty inefficient for what we use it for here.

In practice, the {{localDeletionTime}} for cells in a sstable is likely to vary 
a lot: if you have a default TTL on your table, the {{localDeletionTime}} for a 
cell is {{<insertion time> + ttl + gc_grace}}, which means every cell could 
well have a different value, which means we can assume most calls to 
{{StreamingHistogram.update()}} will see a different value, and that means 
we'll constantly hit {{maxBinSize}} and take the slow path of that method.

But in practice, we use it to estimate if a sstable has more than a certain 
ratio of droppable data to trigger a special tombstone-removal compaction, so 
we can likely afford some imprecision. So a dead simple idea could be to round 
the {{localDeletionTime}} to say the closest hour before adding it to the 
histogram. Losing some precision to the hour is unlikely to matter for the 
optimization we're talking about, but this would most likely reduce a lot the 
amount of time we hit the slow path on {{update}} in most cases.

Or we can probably do a lot better with a little bit more involvment. For 
isntance, it's not very hard to know upfront (at the time we create the 
{{MetadataCollector}}) the min/max {{localDeletionTime}} we're gonna see: we 
can track that easily at the memtable level (we already track the min in fact 
in the {{EncodingStats}}) for flushes, and we can use the stats of the 
compacted sstables on compaction. So we could take that amplitude, divide it by 
how many bucket we're willing to afford (equivalent to the current 
{{maxBinSize}}) and have a simple array of buckets. Updating would be some 
trivial math and an array access and that's it. Of course, it would be less 
precise than the current approach, especially when the concrete deletion times 
are not well distributed over the amplitude we consider, but I don't think that 
matter for this use case.


> 33% of compaction time spent in StreamingHistogram.update()
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-13038
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13038
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>            Reporter: Corentin Chary
>         Attachments: compaction-streaminghistrogram.png, 
> profiler-snapshot.nps, tombstone-histograms-expiring.patch
>
>
> With the following table, that contains a *lot* of cells: 
> {code}
> CREATE TABLE biggraphite.datapoints_11520p_60s (
>     metric uuid,
>     time_start_ms bigint,
>     offset smallint,
>     count int,
>     value double,
>     PRIMARY KEY ((metric, time_start_ms), offset)
> ) WITH CLUSTERING ORDER BY (offset DESC);
> AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 
> 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS', 
> 'max_threshold': '32', 'min_threshold': '6'}
> Keyspace : biggraphite
>         Read Count: 1822
>         Read Latency: 1.8870054884742042 ms.
>         Write Count: 2212271647
>         Write Latency: 0.027705127678653473 ms.
>         Pending Flushes: 0
>                 Table: datapoints_11520p_60s
>                 SSTable count: 47
>                 Space used (live): 300417555945
>                 Space used (total): 303147395017
>                 Space used by snapshots (total): 0
>                 Off heap memory used (total): 207453042
>                 SSTable Compression Ratio: 0.4955200053039823
>                 Number of keys (estimate): 16343723
>                 Memtable cell count: 220576
>                 Memtable data size: 17115128
>                 Memtable off heap memory used: 0
>                 Memtable switch count: 2872
>                 Local read count: 0
>                 Local read latency: NaN ms
>                 Local write count: 1103167888
>                 Local write latency: 0.025 ms
>                 Pending flushes: 0
>                 Percent repaired: 0.0
>                 Bloom filter false positives: 0
>                 Bloom filter false ratio: 0.00000
>                 Bloom filter space used: 105118296
>                 Bloom filter off heap memory used: 106547192
>                 Index summary off heap memory used: 27730962
>                 Compression metadata off heap memory used: 73174888
>                 Compacted partition minimum bytes: 61
>                 Compacted partition maximum bytes: 51012
>                 Compacted partition mean bytes: 7899
>                 Average live cells per slice (last five minutes): NaN
>                 Maximum live cells per slice (last five minutes): 0
>                 Average tombstones per slice (last five minutes): NaN
>                 Maximum tombstones per slice (last five minutes): 0
>                 Dropped Mutations: 0
> {code}
> It looks like a good chunk of the compaction time is lost in 
> StreamingHistogram.update() (which is used to store the estimated tombstone 
> drop times).
> This could be caused by a huge number of different deletion times which would 
> makes the bin huge but it this histogram should be capped to 100 keys. It's 
> more likely caused by the huge number of cells.
> A simple solutions could be to only take into accounts part of the cells, the 
> fact the this table has a TWCS also gives us an additional hint that sampling 
> deletion times would be fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to