Drew - check out Hadoop, I believe there are a few Bloom filter implementations 
there.
 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Drew Farris <[email protected]>
> To: [email protected]
> Sent: Wed, January 6, 2010 10:23:52 PM
> Subject: Re: n-grams for terms?
> 
> Jake,
> 
> Thanks for mentioning this approach. The
> ShingleFilter/ShingleAnalyzerWrapper is pretty handy and I'd never
> used it before.
> 
> Is there a bloom filter implementation somewhere in Mahout or
> elsewhere in the lucene ecosystem?
> 
> Drew
> 
> On Wed, Jan 6, 2010 at 8:41 PM, Jake Mannix wrote:
> 
> > The way I've done this is to take whatever unigram analyzer for tokenization
> >  that
> > fits what you want to do, wrap it in Lucene's ShingleAnalyzer, and use that
> > as the
> > "tokenizer" (which now produces ngram tokens as single tokens each), and run
> > that
> > through the LLR ngram M/R job (which ends by sorting descending by LLR
> > score),
> > and shove the top-K ngrams (and sometimes the unigrams which fit some
> > "good"
> > IDF range) into a big bloom filter, which is serialized and saved.
> >
> > With that, you can take that original ShingleAnalyzer you used previously,
> > and to
> > produce vectors, you take the ngram token stream output and check each
> > emitted
> > token to see if it is the bloom filter, if not, discard.  If it is, you can
> > hash (or multiply
> > hash it) it to get the ngram id for that token.  Of course, that doesn't
> > properly
> > normalize the columns of your term-document matrix (you don't have your IDF
> > factors), but you can do that as a post-processing step after this one.
> >
> >  -jake

Reply via email to