Re: Bloom Filter improvements in postgesql

Matthias van de Meent Fri, 04 Jul 2025 10:26:26 -0700

On Fri, 4 Jul 2025 at 17:43, Ross Heaney <[email protected]> wrote:
> The witness does not store information for every element in the entire 
> universe. Instead, it only needs to store a bit for each element that passes 
> the Bloom filter test.


I think this has a major flaw, in that it is nearly impossible to
guarantee zero false positives in practice using the scheme you
describe. Assuming 1 bit for each element that passes the main Bloom
filter test, the witness list needs to be sized to O(universe *
fp_rate) to accomodate a bit for all passed elements (with some margin
to either direction to account for variability in the practical Bloom
filter false positive rate).
When you store 32-bit integers, this would be ~2^22 bits for a false
positive rate of 1/1024. Storing 64-bit integers at the same fp_rate
would require ~2^54 bits in this witness list. Assuming the bloom
filter itself contains only N elements, and the size of a bloom filter
is proportional to -log(fp_rate) * N, which for any complex data type
is orders of magnitude smaller than the witness list.

The bloom filter is orders of magnitude smaller, because this witness
list only works when we assume that there is no pidgeon hole issue
with the values stored, that we're only looking for small integer
types, or hashes and don't mind hash collisions and the related false
positives.
It is quite possible (guaranteed even!) that two different strings
hash to exactly the same values when using a fixed-size hash output,
like the hash methods used in PostgreSQL. A witness list that uses
your described method to guarantee no false positives even for strings
would have to be sized to the order of 2^(8 * 2^30), give or take a
few zeroes, to account for all possible string values in PostgreSQL.
That just isn't practical.

Actually, after looking at your 0001 patch, it looks like your
'witness' is not a list of both true and false positives, but a
hashmap of only the true positive values, i.e. the list of inserted
values. That does indeed make the combined data structure mostly
"lossless", but
1.) this adds O(n_inserted) overhead with a large constant multiplier
on n_inserted, as hash maps have about 24B/192bits of overhead per
entry, which is probably an order of magnitude more than the
approximately O(-log(p)) of the bloom filter itself; and
2.) this won't solve the pidgeon hole issue with trying to use a bloom
filter on strings or other data types that might have more than the
kept 64 bits of entropy.

So, I'm sorry, but I don't think this is a lossless Bloom filter.

Kind regards,

Matthias van de Meent
Databricks

Re: Bloom Filter improvements in postgesql

Reply via email to