On 1 May 2017 at 20:46, Robert Haas <robertmh...@gmail.com> wrote:
>  One problem is that Bloom filters assume you can get
> n independent hash functions for a given value, which we have not got.
> That problem would need to be solved somehow.  If you only have one
> hash function, the size of the required bloom filter probably gets
> very large.

There's a simple formula to calculate the optimal number of hash
functions and size of the filter given a target false positive rate.

But I don't think this is as big of a problem as you imagine.

a) we don't really only have one hash function, we have a 32-bit hash
function and we could expand that to a larger bit size if we wanted.
Bloom filters are never 2^32 size bit arrays for obvious reasons. If
you have a 1kbit sized bloom filter that only needs 10 bits per index
so you have three fully independent hash functions available already.
If we changed to a 64-bit or 128-bit hash function then you could have
enough bits available to have a larger set of hash functions and a
larger array.

b) you can get a poor man's universal hash out of hash_any or hash_int
by just tweaking the input value in a way that doesn't interact in a
simple way with the hash function. Even something as simple has xoring
it with a random number (i.e. a vector of random numbers that identify
your randomly chosen distinct "hash functions") seems to work fine.

However for future-proofing security hardening I think Postgres should
really implement a real mathematically rigorous Universal Hashing
scheme which provides a family of hash functions from which to pick
randomly. This prevents users from being able to generate data that
would intentionally perform poorly in hash data structures for
example. But it also means you have a whole family of hash functions
to pick for bloom filters.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to