[ 
https://issues.apache.org/jira/browse/CASSANDRA-9167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128161#comment-15128161
 ] 

Benedict commented on CASSANDRA-9167:
-------------------------------------

[~snazy]: sorry for completely letting this fester.  It looks good to me, the 
only question is if we _want_ to trade CPU and memory-indirection costs.  In 
principle we do, since this is to prevent disk accesses.  However for many of 
our benchmark in-memory workloads this is probably only increasing our costs.  
However the cost increase is likely minimal compared to our other costs, so on 
balance with the present state of affairs I'd say this is a pretty darn 
positive change.

That said, I think it would be neater if we could permit a degree of 
configurability, where we weigh the marginal cost increase versus the marginal 
false positive gain for any added hash function, and provide users the option 
of configuring the tradeoff between the two.

That also said, it's perhaps overthinking things for now, and this positive if 
unglamorous change should have been included a long time ago IMO.

Assuming it goes in roughly as-is, a few nits/suggestions:

# Add a 'd' suffix to the literals
# Re-calculate the buckets per-element after rounding the hash count, instead 
of adding a fudge-factor (after all, half the time that fudge factor is wasting 
space, the other half it may be insufficient)?

> Improve bloom-filter false-positive-ratio
> -----------------------------------------
>
>                 Key: CASSANDRA-9167
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9167
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Assignee: Robert Stupp
>            Priority: Minor
>              Labels: perfomance
>
> {{org.apache.cassandra.utils.BloomCalculations}} performs some table lookups 
> to calculate the bloom filter specification (size, # of hashes). Using the 
> exact maths for that computation brings a better false-positive-ratio (the 
> maths usually returns higher numbers for hash-counts).
> TL;DR increasing the number of hash-rounds brings a nice improvement. Finally 
> it's a trade-off between CPU and I/O.
> ||false-positive-chance||elements||capacity||hash count 
> new||false-positive-ratio new||hash count current||false-positive-ratio 
> current||improvement
> |0.1|10000|50048|3|0.0848|3|0.0848|0
> |0.1|100000|500032|3|0.09203|3|0.09203|0
> |0.1|1000000|5000064|3|0.0919|3|0.0919|0
> |0.1|10000000|50000064|3|0.09182|3|0.09182|0
> |0.1|100000000|500000064|3|0.091874|3|0.091874|0
> |0.01|10000|100032|7|0.0092|5|0.0107|0.1630434783
> |0.01|100000|1000064|7|0.00818|5|0.00931|0.1381418093
> |0.01|1000000|10000064|7|0.008072|5|0.009405|0.1651387512
> |0.01|10000000|100000064|7|0.008174|5|0.009375|0.146929288
> |0.01|100000000|1000000064|7|0.008197|5|0.009428|0.150176894
> |0.001|10000|150080|10|0.0008|7|0.001|0.25
> |0.001|100000|1500032|10|0.0006|7|0.00094|0.5666666667
> |0.001|1000000|15000064|10|0.000717|7|0.000991|0.3821478382
> |0.001|10000000|150000064|10|0.000743|7|0.000992|0.33512786
> |0.001|100000000|1500000064|10|0.000741|7|0.001002|0.3522267206
> |0.0001|10000|200064|13|0|10|0.0002|#DIV/0!
> |0.0001|100000|2000064|13|0.00004|10|0.0001|1.5
> |0.0001|1000000|20000064|13|0.000075|10|0.000091|0.2133333333
> |0.0001|10000000|200000064|13|0.000069|10|0.000087|0.2608695652
> |0.0001|100000000|2000000064|13|0.000068|10|0.00009|0.3235294118
> If we decide to allow more hash-rounds, it could be nicely back-ported even 
> to 2.0 without affecting existing sstables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to