Re: Bloom Filter improvements in postgesql

Ross Heaney Sat, 05 Jul 2025 07:18:58 -0700

Hi,

I appreciate you taking the time to leave feedback. I would like to address
your points

I think this has a major flaw, in that it is nearly impossible to
guarantee zero false positives in practice using the scheme you
describe. Assuming 1 bit for each element that passes the main Bloom
filter test, the witness list needs to be sized to O(universe *
fp_rate) to accomodate a bit for all passed elements (with some margin
to either direction to account for variability in the practical Bloom
filter false positive rate).
When you store 32-bit integers, this would be ~2^22 bits for a false
positive rate of 1/1024. Storing 64-bit integers at the same fp_rate
would require ~2^54 bits in this witness list. Assuming the bloom
filter itself contains only N elements, and the size of a bloom filter
is proportional to -log(fp_rate) * N, which for any complex data type
is orders of magnitude smaller than the witness list.

You've highlighted an important point. I was quite lazy in how I defined
the universe of possible values to be. I defined it to be all possible 64
bit values and so your calculation is correct. I have been primarily using
this scheme to compress video where the universe of possible values is much
much smaller and I did not fully think through the implications of allowing
the universe of possible values to be that large.

After thinking about this a bit more I don't think it's practical at all to
use this scheme in PostgreSQL. I still think the ideas from the paper
<https://arxiv.org/abs/2502.02193>  around using a rational number of hash
functions is worthwhile and I would be happy to provide a patch for this

Kind regards,
Ross

On Fri, Jul 4, 2025 at 6:26 PM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Fri, 4 Jul 2025 at 17:43, Ross Heaney <rheane...@gmail.com> wrote:
> > The witness does not store information for every element in the entire
> universe. Instead, it only needs to store a bit for each element that
> passes the Bloom filter test.
>
> I think this has a major flaw, in that it is nearly impossible to
> guarantee zero false positives in practice using the scheme you
> describe. Assuming 1 bit for each element that passes the main Bloom
> filter test, the witness list needs to be sized to O(universe *
> fp_rate) to accomodate a bit for all passed elements (with some margin
> to either direction to account for variability in the practical Bloom
> filter false positive rate).
> When you store 32-bit integers, this would be ~2^22 bits for a false
> positive rate of 1/1024. Storing 64-bit integers at the same fp_rate
> would require ~2^54 bits in this witness list. Assuming the bloom
> filter itself contains only N elements, and the size of a bloom filter
> is proportional to -log(fp_rate) * N, which for any complex data type
> is orders of magnitude smaller than the witness list.
>
> The bloom filter is orders of magnitude smaller, because this witness
> list only works when we assume that there is no pidgeon hole issue
> with the values stored, that we're only looking for small integer
> types, or hashes and don't mind hash collisions and the related false
> positives.
> It is quite possible (guaranteed even!) that two different strings
> hash to exactly the same values when using a fixed-size hash output,
> like the hash methods used in PostgreSQL. A witness list that uses
> your described method to guarantee no false positives even for strings
> would have to be sized to the order of 2^(8 * 2^30), give or take a
> few zeroes, to account for all possible string values in PostgreSQL.
> That just isn't practical.
>
> Actually, after looking at your 0001 patch, it looks like your
> 'witness' is not a list of both true and false positives, but a
> hashmap of only the true positive values, i.e. the list of inserted
> values. That does indeed make the combined data structure mostly
> "lossless", but
> 1.) this adds O(n_inserted) overhead with a large constant multiplier
> on n_inserted, as hash maps have about 24B/192bits of overhead per
> entry, which is probably an order of magnitude more than the
> approximately O(-log(p)) of the bloom filter itself; and
> 2.) this won't solve the pidgeon hole issue with trying to use a bloom
> filter on strings or other data types that might have more than the
> kept 64 bits of entropy.
>
> So, I'm sorry, but I don't think this is a lossless Bloom filter.
>
> Kind regards,
>
> Matthias van de Meent
> Databricks
>

Re: Bloom Filter improvements in postgesql

Reply via email to