Ariel Weisberg commented on CASSANDRA-9167:

I don't want to derail this, but the way most bloom filter libraries work there 
is an efficient curve they try to fit where they they pick the maximum number 
of hash functions to efficiently use the bits necessary to achieve a specific 
FPP. IOW they always optimize for space efficiency over # of hash functions.

As an alternative you could increase the number of bits in the filter and 
reduce the # of hash functions to trade space for speed and meet the desired 
FPP. However most BF libraries I have seen don't allow you to do this. FPP is 
maybe not the thing that should be negotiable? Seems like FPP and space vs 
speed are two separate concerns since desired FPP is a function of the # of 
bloom filters you intend to check. I think LSMs are a bit of a special case 
because you intend to check several bloom filters and not just one and you 
desire an aggregate FPP across several bloom filter checks.

When you look at the # of hash functions in practice it can be close to the 
cost in terms of # of cache and TLB misses to walking a relatively small tree. 
I've played with this a bit microbenchmarking just bloom filters and the 
tradeoff has real impact especially if you are using something like size tiered 
and will need to check many sstables.

> Improve bloom-filter false-positive-ratio
> -----------------------------------------
>                 Key: CASSANDRA-9167
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9167
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Robert Stupp
>            Assignee: Robert Stupp
>            Priority: Minor
>              Labels: perfomance
> {{org.apache.cassandra.utils.BloomCalculations}} performs some table lookups 
> to calculate the bloom filter specification (size, # of hashes). Using the 
> exact maths for that computation brings a better false-positive-ratio (the 
> maths usually returns higher numbers for hash-counts).
> TL;DR increasing the number of hash-rounds brings a nice improvement. Finally 
> it's a trade-off between CPU and I/O.
> ||false-positive-chance||elements||capacity||hash count 
> new||false-positive-ratio new||hash count current||false-positive-ratio 
> current||improvement
> |0.1|10000|50048|3|0.0848|3|0.0848|0
> |0.1|100000|500032|3|0.09203|3|0.09203|0
> |0.1|1000000|5000064|3|0.0919|3|0.0919|0
> |0.1|10000000|50000064|3|0.09182|3|0.09182|0
> |0.1|100000000|500000064|3|0.091874|3|0.091874|0
> |0.01|10000|100032|7|0.0092|5|0.0107|0.1630434783
> |0.01|100000|1000064|7|0.00818|5|0.00931|0.1381418093
> |0.01|1000000|10000064|7|0.008072|5|0.009405|0.1651387512
> |0.01|10000000|100000064|7|0.008174|5|0.009375|0.146929288
> |0.01|100000000|1000000064|7|0.008197|5|0.009428|0.150176894
> |0.001|10000|150080|10|0.0008|7|0.001|0.25
> |0.001|100000|1500032|10|0.0006|7|0.00094|0.5666666667
> |0.001|1000000|15000064|10|0.000717|7|0.000991|0.3821478382
> |0.001|10000000|150000064|10|0.000743|7|0.000992|0.33512786
> |0.001|100000000|1500000064|10|0.000741|7|0.001002|0.3522267206
> |0.0001|10000|200064|13|0|10|0.0002|#DIV/0!
> |0.0001|100000|2000064|13|0.00004|10|0.0001|1.5
> |0.0001|1000000|20000064|13|0.000075|10|0.000091|0.2133333333
> |0.0001|10000000|200000064|13|0.000069|10|0.000087|0.2608695652
> |0.0001|100000000|2000000064|13|0.000068|10|0.00009|0.3235294118
> If we decide to allow more hash-rounds, it could be nicely back-ported even 
> to 2.0 without affecting existing sstables.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to