Re: BloomFilter-s with Lucene

eks dev Fri, 30 Jan 2009 13:19:15 -0800

I have used them for speeding up huge switch clauses in charset normalization 
(eg lowercase and accent->plain form mapping). Big number of accented 
characters (this causes big switch statement) that appear seldom in corpus (big 
majority being not accented). If negative test, you do just simple array 
access, if positive do full work with hige switch statement.


       



----- Original Message ----
> From: Andrzej Bialecki <[email protected]>
> To: [email protected]
> Sent: Friday, 30 January, 2009 21:42:13
> Subject: Re: BloomFilter-s with Lucene
> 
> markharw00d wrote:
> > Andrzej Bialecki wrote:
> > 
> > Funny, I was having vague thoughts about this today too having been 
> > concerned 
> about some of the big arrays that can end up in a typical Lucene app. Aside 
> from 
> providing space-efiicient lookups, another application for BloomFilters is in 
> similarity measures e.g. ANDing 2 BloomFilters can provide the basis for a 
> rough 
> measure of similarity.
> 
> Yes, it's an intriguing thought - worth checking how well it works in 
> practice.
> 
> There are some other cool things you can do with BloomFilter-s:
> 
> * a counting BloomFilter allows both adding and deleting the keys, and the 
> estimation of frequency of a key (if the same key is added multiple times).
> 
> * spectral BloomFilters provide a way to store a (small) exact value within a 
> filter, so that when you test the membership you can also retrieve the value 
> corresponding to that key (perhaps good for idf?). This could be potentially 
> useful in many places, but one specific application is outlined here: 
> http://markmail.org/message/xyqdz3go6jwu4ozm
> 
> * if you use BloomFilter just for quick lookup, but you want to eliminate 
> false 
> positives, it's possible to probe the main filter using all keys from the 
> keyspace, and build a small table of exceptions that eliminates false 
> positives 
> from the first filter.
> 
> * dynamic BloomFilter-s don't require that you know the number of keys in 
> advance. This could be useful e.g. to quickly eliminate nonexistent terms 
> from a 
> query, without actually looking up the query terms in the dictionary - just 
> maintain a dynamic BloomFilter of all terms in the index.
> 
> * another thought: we could create a BloomFilter from all terms in a 
> document, 
> and store it as a field in a document. I don't know yet what for ... :) well, 
> are there any scenarios when you test the document whether it contains a list 
> of 
> words? perhaps spam/porn detection.
> 
> Keep the ideas coming ... I have a feeling that eventually something useful 
> will 
> come out of it.
> 
> -- Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: BloomFilter-s with Lucene

Reply via email to