[
https://issues.apache.org/jira/browse/CASSANDRA-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Branimir Lambov updated CASSANDRA-18123:
----------------------------------------
Since Version: 3.0.0
> Reuse of metadata collector can break key count calculation
> -----------------------------------------------------------
>
> Key: CASSANDRA-18123
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18123
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Compaction
> Reporter: Branimir Lambov
> Priority: Normal
>
> When flushing a memtable we currently pass a constructed
> {{MetadataCollector}} to the {{SSTableMultiWriter}} that is used for writing
> sstables. The latter may decide to split the data into multiple sstables
> (e.g. for separate disks or driven by compaction strategy) — if it does so,
> the cardinality estimation component in the reused {{MetadataCollector}} for
> each individual sstable contains the data for all of them.
> As a result, when such sstables are compacted the estimation for the number
> of keys in the resulting sstables, which is used to determine the size of the
> bloom filter for the compaction result, is heavily overestimated.
> This results in much bigger L1 bloom filters than they should be. One example
> (which came about during testing of the upcoming CEP-26, after insertion of
> 100GB data with 10% reads):
> (current)
> {code}
> Bloom filter false positives: 22627369
> Bloom filter false ratio: 0.02257
> Bloom filter space used: 1848247864
> Bloom filter off heap memory used: 2338964088
> {code}
> (fixed)
> {code}
> Bloom filter false positives: 24426545
> Bloom filter false ratio: 0.02429
> Bloom filter space used: 1118910096
> Bloom filter off heap memory used: 1532357432
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]