Super cool!

Yeah, I find the intuition on idea hash count to be tricky as well.
Increasing the number of hash functions increases saturation (not
desirable), but also decreases false positives (desirable) since `fpp =
s**k` (where s is saturation and k is the number of hash functions). Since
the ideal saturation is 0.5, fpp drops exponentially as k increases
(assuming s == 0.5).

But, k negatively impacts saturation (`s = 1 / e**(-kn/m)`). It turns out
the ideal k value is `k = (m/n) * ln(2)`, which provides a saturation of
0.5. All this means that you can first pick an ideal size (m) based on
the expected number of unique values (n) by substituting `(m/n) * ln(2)`
for k, then compute k based on the chosen value for m. This is exactly what
the FuzzySet implementation does (as of GH#11900).

But on the intuition bit, this helped me: If you crank k way up, you drive
saturation way up. No number of hash functions will really help you if
saturation is super high. In other words, since `fpp = s**k`, it's going to
take a really large k to get that fpp down (e.g., if s == 0.9, you'd need a
k ~= 20 to get fpp ~= 0.1), but trying to crank k up just further increases
saturation.

I found this calculator tool (
https://hur.st/bloomfilter/?n=500000000&p=.1&m=&k=) useful in visualizing
the interaction between these different variables (in particular, fpp vs.
k).

All this also makes me wonder why we make "target saturation" in FuzzySet
extensible (
https://github.com/apache/lucene/blob/a55be5cb368cba698114f63bc8e8be2a2a55b089/lucene/codecs/src/java/org/apache/lucene/codecs/bloom/FuzzySet.java#L298).
Since 0.5 is ideal, it's unclear to me why anyone would want to set a
different target for downsizing. Maybe this sort of carries over from cases
where a single hash function is used (we still have methods that setup
FuzzySet this way)? I think there might be an opportunity to clean up the
API a bit and get rid of the ability to create FuzzySet with a hardcoded k
== 1 (shameless plug for a PR I opened back in May to do this: GH#14615).

Cheers,
-g

On Sat, Dec 20, 2025 at 1:50 AM Michael Wechner <[email protected]>
wrote:

> very cool, thanks for sharing!
> Am 19.12.25 um 18:13 schrieb Michael McCandless:
>
> Hi Team, I thought this might be interesting to some of us :)
>
> If anyone out there also struggles to understand / intuit how a bloom
> filter's multiple hash functions (which Lucene's bloom filter implements)
> improve the false positive error rate, here is an interactive 3D
> surface/chart to see how hash function count (k), bits per element (m/n),
> and false positive error rate relate:
>
>     https://githubsearch.mikemccandless.com/gemini_bloom10.html
>
> The mouseover also shows the saturation (percent bits set) but it's not
> rendered in the graph.
>
> I created this by iterating with Gemini Pro 3 ... see the full
> prompting/iterations here: https://gemini.google.com/share/46a219e0abaf
>
> Net/net doing multiple hash functions into the same backing bit set is
> surprisingly worthwhile even as the saturation grows, up to 50%.
>
> It's entirely possible Gemini hallucinated something in the graph!!  But
> it seems sorta correct to me.  And of course that false positive error rate
> formula involves e!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>

Reply via email to