On Fri, 4 Jul 2025 at 17:43, Ross Heaney <rheane...@gmail.com> wrote: > The witness does not store information for every element in the entire > universe. Instead, it only needs to store a bit for each element that passes > the Bloom filter test.
I think this has a major flaw, in that it is nearly impossible to guarantee zero false positives in practice using the scheme you describe. Assuming 1 bit for each element that passes the main Bloom filter test, the witness list needs to be sized to O(universe * fp_rate) to accomodate a bit for all passed elements (with some margin to either direction to account for variability in the practical Bloom filter false positive rate). When you store 32-bit integers, this would be ~2^22 bits for a false positive rate of 1/1024. Storing 64-bit integers at the same fp_rate would require ~2^54 bits in this witness list. Assuming the bloom filter itself contains only N elements, and the size of a bloom filter is proportional to -log(fp_rate) * N, which for any complex data type is orders of magnitude smaller than the witness list. The bloom filter is orders of magnitude smaller, because this witness list only works when we assume that there is no pidgeon hole issue with the values stored, that we're only looking for small integer types, or hashes and don't mind hash collisions and the related false positives. It is quite possible (guaranteed even!) that two different strings hash to exactly the same values when using a fixed-size hash output, like the hash methods used in PostgreSQL. A witness list that uses your described method to guarantee no false positives even for strings would have to be sized to the order of 2^(8 * 2^30), give or take a few zeroes, to account for all possible string values in PostgreSQL. That just isn't practical. Actually, after looking at your 0001 patch, it looks like your 'witness' is not a list of both true and false positives, but a hashmap of only the true positive values, i.e. the list of inserted values. That does indeed make the combined data structure mostly "lossless", but 1.) this adds O(n_inserted) overhead with a large constant multiplier on n_inserted, as hash maps have about 24B/192bits of overhead per entry, which is probably an order of magnitude more than the approximately O(-log(p)) of the bloom filter itself; and 2.) this won't solve the pidgeon hole issue with trying to use a bloom filter on strings or other data types that might have more than the kept 64 bits of entropy. So, I'm sorry, but I don't think this is a lossless Bloom filter. Kind regards, Matthias van de Meent Databricks