[
https://issues.apache.org/jira/browse/CASSANDRA-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857854#comment-15857854
]
Sylvain Lebresne commented on CASSANDRA-13038:
----------------------------------------------
Fwiw, I agree that making {{StreamingHistogram}} more efficient is needed and
should be done, but I also have _strong_ doubt about [~jjirsa]'s affirmation
that rounding up to the next hour "is not reasonable for the general
population" _given_ where and why it is used, and I'd be really curious to see
a non-very-special use case where doing so would have a noticeably bad impact.
In fact, _even_ when we do make {{StreamingHistogram}} more efficient, it'll
still have more work to do if we keep the second precision than if we round it
up a bit (doesn't have to be to the hour, could be even just to 5-10 minutes)
and since I really doubt we need second precision here, I feel not doing at
least a bit of rounding up is just a waste of perfectly good CPU cycles.
Again, let me stress that with a default table TTL and for and a decently
loaded time series workload (both of which are not uncommon), you are fairly
likely to see a {{localDeletionTime}} for almost every second of whatever time
span your sstable covers. But in what world do user cares about an sstable
being dropped at the exact second at which the last data it contains gets
gcable (versus being perfectly happy with it happening within some reasonably
short time window)? Keeping in mind that if users do care about such crazy
precision, we're actually failing them already since despite keeping such
precision as input of {{StreamingHistogram}}, it remains that 1)
{{StreamingHistogram}} has a relatively small bucket size so it loses precision
on its own and 2) the current compaction code isn't even _checking_ for sstable
to drop every seconds in the first place (in all fairness, we could be smarter
here but...).
Anyway, all this to say that I have personnally no time in the near future to
rewrite {{StreamingHistogram}} more efficiently, Corentin said he doesn't
either and no-one else has stepped up in the last month, so I'd personally be
fine (and in favor really) with doing some rounding up of the time we pass as
input to {{StreamingHistogram}}. And rounding up to the hour is arguably a bit
of a big hammer, so I'd actually suggest something along the line of 5-10
minutes (which should still improve the situation described on this ticket
substantially).
We'll obviously still want to have a ticket to improve the implementation
(though as said above, I'd be in favor of _keeping_ the rounding up even then)
but that would at least make it less urgent.
With all that said, [~jjirsa], you have expressed some strong opposition to
rounding up above (though maybe a smaller 10 minute rouding up is more
acceptable?), and maybe that opposition was shared by other, so I'll just leave
that to "here's what I would do and why" for now.
> 33% of compaction time spent in StreamingHistogram.update()
> -----------------------------------------------------------
>
> Key: CASSANDRA-13038
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13038
> Project: Cassandra
> Issue Type: Bug
> Components: Compaction
> Reporter: Corentin Chary
> Assignee: Corentin Chary
> Attachments: compaction-speedup.patch,
> compaction-streaminghistrogram.png, profiler-snapshot.nps
>
>
> With the following table, that contains a *lot* of cells:
> {code}
> CREATE TABLE biggraphite.datapoints_11520p_60s (
> metric uuid,
> time_start_ms bigint,
> offset smallint,
> count int,
> value double,
> PRIMARY KEY ((metric, time_start_ms), offset)
> ) WITH CLUSTERING ORDER BY (offset DESC);
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> 'compaction_window_size': '6', 'compaction_window_unit': 'HOURS',
> 'max_threshold': '32', 'min_threshold': '6'}
> Keyspace : biggraphite
> Read Count: 1822
> Read Latency: 1.8870054884742042 ms.
> Write Count: 2212271647
> Write Latency: 0.027705127678653473 ms.
> Pending Flushes: 0
> Table: datapoints_11520p_60s
> SSTable count: 47
> Space used (live): 300417555945
> Space used (total): 303147395017
> Space used by snapshots (total): 0
> Off heap memory used (total): 207453042
> SSTable Compression Ratio: 0.4955200053039823
> Number of keys (estimate): 16343723
> Memtable cell count: 220576
> Memtable data size: 17115128
> Memtable off heap memory used: 0
> Memtable switch count: 2872
> Local read count: 0
> Local read latency: NaN ms
> Local write count: 1103167888
> Local write latency: 0.025 ms
> Pending flushes: 0
> Percent repaired: 0.0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 105118296
> Bloom filter off heap memory used: 106547192
> Index summary off heap memory used: 27730962
> Compression metadata off heap memory used: 73174888
> Compacted partition minimum bytes: 61
> Compacted partition maximum bytes: 51012
> Compacted partition mean bytes: 7899
> Average live cells per slice (last five minutes): NaN
> Maximum live cells per slice (last five minutes): 0
> Average tombstones per slice (last five minutes): NaN
> Maximum tombstones per slice (last five minutes): 0
> Dropped Mutations: 0
> {code}
> It looks like a good chunk of the compaction time is lost in
> StreamingHistogram.update() (which is used to store the estimated tombstone
> drop times).
> This could be caused by a huge number of different deletion times which would
> makes the bin huge but it this histogram should be capped to 100 keys. It's
> more likely caused by the huge number of cells.
> A simple solutions could be to only take into accounts part of the cells, the
> fact the this table has a TWCS also gives us an additional hint that sampling
> deletion times would be fine.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)