Re: BloomFilter-s with Lucene

pdecrem Fri, 30 Jan 2009 12:58:28 -0800

Well.  I used 2 Broder similarity measures, and it works well.  You obviously 
need to pick the right size bf's.

Navendu Jain has a paper called using bloomfilters to refine web search 
results, which I think is relevant here. I talks about how remove near 
duplicate search results using bf's.

-----Original Message-----
From: Andrzej Bialecki <a...@getopt.org>

Date: Fri, 30 Jan 2009 21:42:13 
To: <java-dev@lucene.apache.org>
Subject: Re: BloomFilter-s with Lucene

markharw00d wrote:
> Andrzej Bialecki wrote:
> 
> Funny, I was having vague thoughts about this today too having been 
> concerned about some of the big arrays that can end up in a typical 
> Lucene app. Aside from providing space-efiicient lookups, another 
> application for BloomFilters is in similarity measures e.g. ANDing 2 
> BloomFilters can provide the basis for a rough measure of similarity.

Yes, it's an intriguing thought - worth checking how well it works in 
practice.

There are some other cool things you can do with BloomFilter-s:

* a counting BloomFilter allows both adding and deleting the keys, and 
the estimation of frequency of a key (if the same key is added multiple 
times).

* spectral BloomFilters provide a way to store a (small) exact value 
within a filter, so that when you test the membership you can also 
retrieve the value corresponding to that key (perhaps good for idf?). 
This could be potentially useful in many places, but one specific 
application is outlined here: http://markmail.org/message/xyqdz3go6jwu4ozm

* if you use BloomFilter just for quick lookup, but you want to 
eliminate false positives, it's possible to probe the main filter using 
all keys from the keyspace, and build a small table of exceptions that 
eliminates false positives from the first filter.

* dynamic BloomFilter-s don't require that you know the number of keys 
in advance. This could be useful e.g. to quickly eliminate nonexistent 
terms from a query, without actually looking up the query terms in the 
dictionary - just maintain a dynamic BloomFilter of all terms in the index.

* another thought: we could create a BloomFilter from all terms in a 
document, and store it as a field in a document. I don't know yet what 
for ... :) well, are there any scenarios when you test the document 
whether it contains a list of words? perhaps spam/porn detection.

Keep the ideas coming ... I have a feeling that eventually something 
useful will come out of it.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: BloomFilter-s with Lucene

Reply via email to