[
https://issues.apache.org/jira/browse/CASSANDRA-9167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128161#comment-15128161
]
Benedict commented on CASSANDRA-9167:
-------------------------------------
[~snazy]: sorry for completely letting this fester. It looks good to me, the
only question is if we _want_ to trade CPU and memory-indirection costs. In
principle we do, since this is to prevent disk accesses. However for many of
our benchmark in-memory workloads this is probably only increasing our costs.
However the cost increase is likely minimal compared to our other costs, so on
balance with the present state of affairs I'd say this is a pretty darn
positive change.
That said, I think it would be neater if we could permit a degree of
configurability, where we weigh the marginal cost increase versus the marginal
false positive gain for any added hash function, and provide users the option
of configuring the tradeoff between the two.
That also said, it's perhaps overthinking things for now, and this positive if
unglamorous change should have been included a long time ago IMO.
Assuming it goes in roughly as-is, a few nits/suggestions:
# Add a 'd' suffix to the literals
# Re-calculate the buckets per-element after rounding the hash count, instead
of adding a fudge-factor (after all, half the time that fudge factor is wasting
space, the other half it may be insufficient)?
> Improve bloom-filter false-positive-ratio
> -----------------------------------------
>
> Key: CASSANDRA-9167
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9167
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Robert Stupp
> Assignee: Robert Stupp
> Priority: Minor
> Labels: perfomance
>
> {{org.apache.cassandra.utils.BloomCalculations}} performs some table lookups
> to calculate the bloom filter specification (size, # of hashes). Using the
> exact maths for that computation brings a better false-positive-ratio (the
> maths usually returns higher numbers for hash-counts).
> TL;DR increasing the number of hash-rounds brings a nice improvement. Finally
> it's a trade-off between CPU and I/O.
> ||false-positive-chance||elements||capacity||hash count
> new||false-positive-ratio new||hash count current||false-positive-ratio
> current||improvement
> |0.1|10000|50048|3|0.0848|3|0.0848|0
> |0.1|100000|500032|3|0.09203|3|0.09203|0
> |0.1|1000000|5000064|3|0.0919|3|0.0919|0
> |0.1|10000000|50000064|3|0.09182|3|0.09182|0
> |0.1|100000000|500000064|3|0.091874|3|0.091874|0
> |0.01|10000|100032|7|0.0092|5|0.0107|0.1630434783
> |0.01|100000|1000064|7|0.00818|5|0.00931|0.1381418093
> |0.01|1000000|10000064|7|0.008072|5|0.009405|0.1651387512
> |0.01|10000000|100000064|7|0.008174|5|0.009375|0.146929288
> |0.01|100000000|1000000064|7|0.008197|5|0.009428|0.150176894
> |0.001|10000|150080|10|0.0008|7|0.001|0.25
> |0.001|100000|1500032|10|0.0006|7|0.00094|0.5666666667
> |0.001|1000000|15000064|10|0.000717|7|0.000991|0.3821478382
> |0.001|10000000|150000064|10|0.000743|7|0.000992|0.33512786
> |0.001|100000000|1500000064|10|0.000741|7|0.001002|0.3522267206
> |0.0001|10000|200064|13|0|10|0.0002|#DIV/0!
> |0.0001|100000|2000064|13|0.00004|10|0.0001|1.5
> |0.0001|1000000|20000064|13|0.000075|10|0.000091|0.2133333333
> |0.0001|10000000|200000064|13|0.000069|10|0.000087|0.2608695652
> |0.0001|100000000|2000000064|13|0.000068|10|0.00009|0.3235294118
> If we decide to allow more hash-rounds, it could be nicely back-ported even
> to 2.0 without affecting existing sstables.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)