Branimir Lambov created CASSANDRA-18123:
-------------------------------------------
Summary: Reuse of metadata collector can break key count
calculation
Key: CASSANDRA-18123
URL: https://issues.apache.org/jira/browse/CASSANDRA-18123
Project: Cassandra
Issue Type: Bug
Components: Local/Compaction
Reporter: Branimir Lambov
When flushing a memtable we currently pass a constructed {{MetadataCollector}}
to the {{SSTableMultiWriter}} that is used for writing sstables. The latter may
decide to split the data into multiple sstables (e.g. for separate disks or
driven by compaction strategy) — if it does so, the cardinality estimation
component in the reused {{MetadataCollector}} for each individual sstable
contains the data for all of them.
As a result, when such sstables are compacted the estimation for the number of
keys in the resulting sstables, which is used to determine the size of the
bloom filter for the compaction result, is heavily overestimated.
This results in much bigger L1 bloom filters than they should be. One example
(which came about during testing of the upcoming CEP-26, after insertion of
100GB data with 10% reads):
(current)
{code}
Bloom filter false positives: 22627369
Bloom filter false ratio: 0.02257
Bloom filter space used: 1848247864
Bloom filter off heap memory used: 2338964088
{code}
(fixed)
{code}
Bloom filter false positives: 24426545
Bloom filter false ratio: 0.02429
Bloom filter space used: 1118910096
Bloom filter off heap memory used: 1532357432
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]