[
https://issues.apache.org/jira/browse/CASSANDRA-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Stupp updated CASSANDRA-9150:
------------------------------------
Description:
As part of CASSANDRA-8413 I wrote a test that checks bloom filter
false-positive-ratio and compares it to the target false-positive-chance.
[The
test|https://github.com/snazy/cassandra/blob/8413-bffp/test/unit/org/apache/cassandra/utils/BloomFilterTest.java#L359]
basically creates a BF using {{FilterFactory.getFilter}} with the number of
elements to be added and the FPC. It then adds that number of elements to the
filter. It checks the false-positive-ratio by calling {{IFilter.isPresent}}
using not-added keys.
It feels that the FPR increases linearly with the number of elements and
exceeds the FPC for filters with more than 1,000,000 elements. Filters with
100M elements have an FPR of .94, which is probably bad.
{code}
invertedHash=false fpc=0,010000 fpr=0,000107 for Int32Type with 10000
elements, bitset capacity= 100032, spec=BloomSpecification(K=5,
bucketsPerElement=10)
invertedHash=false fpc=0,010000 fpr=0,000931 for Int32Type with 100000
elements, bitset capacity= 1000064, spec=BloomSpecification(K=5,
bucketsPerElement=10)
invertedHash=false fpc=0,010000 fpr=0,009405 for Int32Type with 1000000
elements, bitset capacity= 10000064, spec=BloomSpecification(K=5,
bucketsPerElement=10)
invertedHash=false fpc=0,010000 fpr=0,093752 for Int32Type with 10000000
elements, bitset capacity= 100000064, spec=BloomSpecification(K=5,
bucketsPerElement=10)
invertedHash=false fpc=0,010000 fpr=0,942765 for Int32Type with 100000000
elements, bitset capacity=1000000064, spec=BloomSpecification(K=5,
bucketsPerElement=10)
{code}
Do I measure something wrong here?
EDIT: it does not depend on the type or the FPC - it's the same pattern for
other types and other FPCs.
was:
As part of CASSANDRA-8413 I wrote a test that checks bloom filter
false-positive-ratio and compares it to the target false-positive-chance.
[The
test|https://github.com/snazy/cassandra/blob/8413-bffp/test/unit/org/apache/cassandra/utils/BloomFilterTest.java#L359]
basically creates a BF using {{FilterFactory.getFilter}} with the number of
elements to be added and the FPC. It then adds that number of elements to the
filter. It checks the false-positive-ratio by calling {{IFilter.isPresent}}
using not-added keys.
It feels that the FPR increases linearly with the number of elements and
exceeds the FPC for filters with more than 1,000,000 elements. Filters with
100M elements have an FPR of .94, which is probably bad.
{code}
invertedHash=false fpc=0,010000 fpr=0,000107 for Int32Type with 10000
elements, bitset capacity= 100032, spec=BloomSpecification(K=5,
bucketsPerElement=10)
invertedHash=false fpc=0,010000 fpr=0,000931 for Int32Type with 100000
elements, bitset capacity= 1000064, spec=BloomSpecification(K=5,
bucketsPerElement=10)
invertedHash=false fpc=0,010000 fpr=0,009405 for Int32Type with 1000000
elements, bitset capacity= 10000064, spec=BloomSpecification(K=5,
bucketsPerElement=10)
invertedHash=false fpc=0,010000 fpr=0,093752 for Int32Type with 10000000
elements, bitset capacity= 100000064, spec=BloomSpecification(K=5,
bucketsPerElement=10)
invertedHash=false fpc=0,010000 fpr=0,942765 for Int32Type with 100000000
elements, bitset capacity=1000000064, spec=BloomSpecification(K=5,
bucketsPerElement=10)
{code}
Do I measure something wrong here?
> BloomFilter false-positive-ratio does not match false-positive-chance
> ---------------------------------------------------------------------
>
> Key: CASSANDRA-9150
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9150
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Robert Stupp
>
> As part of CASSANDRA-8413 I wrote a test that checks bloom filter
> false-positive-ratio and compares it to the target false-positive-chance.
> [The
> test|https://github.com/snazy/cassandra/blob/8413-bffp/test/unit/org/apache/cassandra/utils/BloomFilterTest.java#L359]
> basically creates a BF using {{FilterFactory.getFilter}} with the number of
> elements to be added and the FPC. It then adds that number of elements to the
> filter. It checks the false-positive-ratio by calling {{IFilter.isPresent}}
> using not-added keys.
> It feels that the FPR increases linearly with the number of elements and
> exceeds the FPC for filters with more than 1,000,000 elements. Filters with
> 100M elements have an FPR of .94, which is probably bad.
> {code}
> invertedHash=false fpc=0,010000 fpr=0,000107 for Int32Type with 10000
> elements, bitset capacity= 100032, spec=BloomSpecification(K=5,
> bucketsPerElement=10)
> invertedHash=false fpc=0,010000 fpr=0,000931 for Int32Type with 100000
> elements, bitset capacity= 1000064, spec=BloomSpecification(K=5,
> bucketsPerElement=10)
> invertedHash=false fpc=0,010000 fpr=0,009405 for Int32Type with 1000000
> elements, bitset capacity= 10000064, spec=BloomSpecification(K=5,
> bucketsPerElement=10)
> invertedHash=false fpc=0,010000 fpr=0,093752 for Int32Type with 10000000
> elements, bitset capacity= 100000064, spec=BloomSpecification(K=5,
> bucketsPerElement=10)
> invertedHash=false fpc=0,010000 fpr=0,942765 for Int32Type with 100000000
> elements, bitset capacity=1000000064, spec=BloomSpecification(K=5,
> bucketsPerElement=10)
> {code}
> Do I measure something wrong here?
> EDIT: it does not depend on the type or the FPC - it's the same pattern for
> other types and other FPCs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)