On Mon, Jan 18, 2010 at 2:29 AM, Greg Stark <gsst...@mit.edu> wrote: > Bloom filters need to be sized to have n bits per set element to > maintain a targeted false positive rate. And that false positive rate > would be higher than just having an n bit hash for each set element. > Normally they have the advantage that you can test for membership > anywhere in the set quickly -- but in this case you actually only want > to test a specific position in the set anyways. > > So I think you can get a lower false positive rate by just truncating > the each column's hash value to n bits and storing an array of them.
So for example if your bloom filter is 4 bits per column your error rate for a single column where clause will be 1/2^(4/1.44) or a little worse than returning 1 record in 7. If you test two or three columns then it'll be about 1 in 49 or 1 in 343. If you just stored a nibble for each column then you could look up the specific nibble for the column in a single column where clause and get an error rate of 1 in 16. If you test two or three columns then it would be 1 in 256 or 1 in 4,096. If you're aiming at very large numbers of columns with very large where clauses then perhaps you only need, say, 2 bits per column. But for any number of bits per column I think you're always better off just storing that many bits of the each hash rather than using a bloom filter for this. Also, as it stands now it's an unordered list. I assume you have no plans to implement vacuum for this? Perhaps it would be better to store a btree of signatures ordered by htid. That would let you easily vacuum dead entries. Though it would necessitate random seeks to do the full scan of the index unless you can solve the full index scan concurrent with page splits problem. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers