On Mon, May 1, 2017 at 12:46 PM, Robert Haas <robertmh...@gmail.com> wrote: > Bloom filters are one of those things that come up on this mailing > list incredibly frequently but rarely get used in committed code; thus > far, contrib/bloom is the only example we've got, and not for lack of > other proposals.
They certainly are a fashionable data structure, but it's not as if they're a new idea. The math behind them is very well understood. They solve one narrow class of problem very well. > One problem is that Bloom filters assume you can get > n independent hash functions for a given value, which we have not got. > That problem would need to be solved somehow. If you only have one > hash function, the size of the required bloom filter probably gets > very large. I don't think that that's a problem, because you really only need 2 hash functions [1], which we have already (recall that Andres added Murmur hash to Postgres 10). It's an area that I'd certainly need to do more research on if I'm to go forward with bloom filters, but I'm pretty confident that there is a robust solution to the practical problem of not having arbitrary many hash functions at hand. (I think that you rarely need all that many independent hash functions, in any case.) It isn't that hard to evaluate whether or not an implementation has things right, at least for a variety of typical cases. We know how to objectively evaluate a hash function while making only some pretty mild assumptions. > When hashing index and heap tuples, do you propose to include the heap > TID in the data getting hashed? I think that would be a good idea, > because otherwise you're only verifying that every heap tuple has an > index pointer pointing at something, not that every heap tuple has an > index tuple pointing at the right thing. Yes -- I definitely want to hash the heap TID from each IndexTuple. > I wonder if it's also worth having a zero-error mode, even if it runs > for a long time. Scan the heap, and probe the index for the value > computed from each heap tuple. Maybe that's so awful that nobody > would ever use it, but I'm not sure. It might actually be simpler to > implement than what you have in mind. It's easy if you don't mind that the implementation will be an ad-hoc nested loop join. I guess I could do that too, if only because it won't be that hard, and that's really what you want when you know you have corruption. Performance will probably be prohibitively poor when verification needs to be run in any kind of routine way, which is a problem if that's the only way it can work. My sense is that verification needs to be reasonably low overhead, and it needs to perform pretty consistently, even if you only use it for stress-testing new features. To reiterate, another thing that makes a bloom filter attractive is how it simplifies resource management relative to an approach involving sorting or a hash table. There are a bunch of edge cases that I don't have to worry about around resource management (e.g., a subset of very wide outlier IndexTuples, or two indexes that are of very different sizes associated with the same table that need to receive an even share of memory). As I said, even if I was totally willing to duplicate the effort that went into respecting work_mem as a budget within places like tuplesort.c, having as little infrastructure code as possible is a specific goal for amcheck. [1] https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers