Hi, I appreciate you taking the time to leave feedback. I would like to address your points
I think this has a major flaw, in that it is nearly impossible to guarantee zero false positives in practice using the scheme you describe. Assuming 1 bit for each element that passes the main Bloom filter test, the witness list needs to be sized to O(universe * fp_rate) to accomodate a bit for all passed elements (with some margin to either direction to account for variability in the practical Bloom filter false positive rate). When you store 32-bit integers, this would be ~2^22 bits for a false positive rate of 1/1024. Storing 64-bit integers at the same fp_rate would require ~2^54 bits in this witness list. Assuming the bloom filter itself contains only N elements, and the size of a bloom filter is proportional to -log(fp_rate) * N, which for any complex data type is orders of magnitude smaller than the witness list. You've highlighted an important point. I was quite lazy in how I defined the universe of possible values to be. I defined it to be all possible 64 bit values and so your calculation is correct. I have been primarily using this scheme to compress video where the universe of possible values is much much smaller and I did not fully think through the implications of allowing the universe of possible values to be that large. After thinking about this a bit more I don't think it's practical at all to use this scheme in PostgreSQL. I still think the ideas from the paper <https://arxiv.org/abs/2502.02193> around using a rational number of hash functions is worthwhile and I would be happy to provide a patch for this Kind regards, Ross On Fri, Jul 4, 2025 at 6:26 PM Matthias van de Meent < boekewurm+postg...@gmail.com> wrote: > On Fri, 4 Jul 2025 at 17:43, Ross Heaney <rheane...@gmail.com> wrote: > > The witness does not store information for every element in the entire > universe. Instead, it only needs to store a bit for each element that > passes the Bloom filter test. > > I think this has a major flaw, in that it is nearly impossible to > guarantee zero false positives in practice using the scheme you > describe. Assuming 1 bit for each element that passes the main Bloom > filter test, the witness list needs to be sized to O(universe * > fp_rate) to accomodate a bit for all passed elements (with some margin > to either direction to account for variability in the practical Bloom > filter false positive rate). > When you store 32-bit integers, this would be ~2^22 bits for a false > positive rate of 1/1024. Storing 64-bit integers at the same fp_rate > would require ~2^54 bits in this witness list. Assuming the bloom > filter itself contains only N elements, and the size of a bloom filter > is proportional to -log(fp_rate) * N, which for any complex data type > is orders of magnitude smaller than the witness list. > > The bloom filter is orders of magnitude smaller, because this witness > list only works when we assume that there is no pidgeon hole issue > with the values stored, that we're only looking for small integer > types, or hashes and don't mind hash collisions and the related false > positives. > It is quite possible (guaranteed even!) that two different strings > hash to exactly the same values when using a fixed-size hash output, > like the hash methods used in PostgreSQL. A witness list that uses > your described method to guarantee no false positives even for strings > would have to be sized to the order of 2^(8 * 2^30), give or take a > few zeroes, to account for all possible string values in PostgreSQL. > That just isn't practical. > > Actually, after looking at your 0001 patch, it looks like your > 'witness' is not a list of both true and false positives, but a > hashmap of only the true positive values, i.e. the list of inserted > values. That does indeed make the combined data structure mostly > "lossless", but > 1.) this adds O(n_inserted) overhead with a large constant multiplier > on n_inserted, as hash maps have about 24B/192bits of overhead per > entry, which is probably an order of magnitude more than the > approximately O(-log(p)) of the bloom filter itself; and > 2.) this won't solve the pidgeon hole issue with trying to use a bloom > filter on strings or other data types that might have more than the > kept 64 bits of entropy. > > So, I'm sorry, but I don't think this is a lossless Bloom filter. > > Kind regards, > > Matthias van de Meent > Databricks >