Well. I used 2 Broder similarity measures, and it works well. You obviously need to pick the right size bf's.
Navendu Jain has a paper called using bloomfilters to refine web search results, which I think is relevant here. I talks about how remove near duplicate search results using bf's. -----Original Message----- From: Andrzej Bialecki <a...@getopt.org> Date: Fri, 30 Jan 2009 21:42:13 To: <java-dev@lucene.apache.org> Subject: Re: BloomFilter-s with Lucene markharw00d wrote: > Andrzej Bialecki wrote: > > Funny, I was having vague thoughts about this today too having been > concerned about some of the big arrays that can end up in a typical > Lucene app. Aside from providing space-efiicient lookups, another > application for BloomFilters is in similarity measures e.g. ANDing 2 > BloomFilters can provide the basis for a rough measure of similarity. Yes, it's an intriguing thought - worth checking how well it works in practice. There are some other cool things you can do with BloomFilter-s: * a counting BloomFilter allows both adding and deleting the keys, and the estimation of frequency of a key (if the same key is added multiple times). * spectral BloomFilters provide a way to store a (small) exact value within a filter, so that when you test the membership you can also retrieve the value corresponding to that key (perhaps good for idf?). This could be potentially useful in many places, but one specific application is outlined here: http://markmail.org/message/xyqdz3go6jwu4ozm * if you use BloomFilter just for quick lookup, but you want to eliminate false positives, it's possible to probe the main filter using all keys from the keyspace, and build a small table of exceptions that eliminates false positives from the first filter. * dynamic BloomFilter-s don't require that you know the number of keys in advance. This could be useful e.g. to quickly eliminate nonexistent terms from a query, without actually looking up the query terms in the dictionary - just maintain a dynamic BloomFilter of all terms in the index. * another thought: we could create a BloomFilter from all terms in a document, and store it as a field in a document. I don't know yet what for ... :) well, are there any scenarios when you test the document whether it contains a list of words? perhaps spam/porn detection. Keep the ideas coming ... I have a feeling that eventually something useful will come out of it. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org