Jake,

Thanks for mentioning this approach. The
ShingleFilter/ShingleAnalyzerWrapper is pretty handy and I'd never
used it before.

Is there a bloom filter implementation somewhere in Mahout or
elsewhere in the lucene ecosystem?

Drew

On Wed, Jan 6, 2010 at 8:41 PM, Jake Mannix <[email protected]> wrote:

> The way I've done this is to take whatever unigram analyzer for tokenization
>  that
> fits what you want to do, wrap it in Lucene's ShingleAnalyzer, and use that
> as the
> "tokenizer" (which now produces ngram tokens as single tokens each), and run
> that
> through the LLR ngram M/R job (which ends by sorting descending by LLR
> score),
> and shove the top-K ngrams (and sometimes the unigrams which fit some
> "good"
> IDF range) into a big bloom filter, which is serialized and saved.
>
> With that, you can take that original ShingleAnalyzer you used previously,
> and to
> produce vectors, you take the ngram token stream output and check each
> emitted
> token to see if it is the bloom filter, if not, discard.  If it is, you can
> hash (or multiply
> hash it) it to get the ngram id for that token.  Of course, that doesn't
> properly
> normalize the columns of your term-document matrix (you don't have your IDF
> factors), but you can do that as a post-processing step after this one.
>
>  -jake

Reply via email to