On Wed, Jan 6, 2010 at 2:26 PM, Ted Dunning <[email protected]> wrote:

>
> Grant, is there a Lucene analyzer that would do that?
>
>
The way I've done this is to take whatever unigram analyzer for tokenization
 that
fits what you want to do, wrap it in Lucene's ShingleAnalyzer, and use that
as the
"tokenizer" (which now produces ngram tokens as single tokens each), and run
that
through the LLR ngram M/R job (which ends by sorting descending by LLR
score),
and shove the top-K ngrams (and sometimes the unigrams which fit some
"good"
IDF range) into a big bloom filter, which is serialized and saved.

With that, you can take that original ShingleAnalyzer you used previously,
and to
produce vectors, you take the ngram token stream output and check each
emitted
token to see if it is the bloom filter, if not, discard.  If it is, you can
hash (or multiply
hash it) it to get the ngram id for that token.  Of course, that doesn't
properly
normalize the columns of your term-document matrix (you don't have your IDF
factors), but you can do that as a post-processing step after this one.

  -jake


On Wed, Jan 6, 2010 at 2:16 PM, Bogdan Vatkov <[email protected]
> >wrote:
>
> > In some of the post I saw something about n-grams but I am not sure how
> can
> > I get clustering with n-grams supported.
> > I am currently running only k-means (I picked it more or less randomly -
> > not
> > sure which algorithms is best for my data) and I only get TopTerms as
> > unigrams - can I get some clustering based on bigrams, trigrams, n-grams?
> >
> > Another question I have is which Mahout clustering algorithm is
> recommended
> > for big amount of relatively small-sized documents? (as I said I use
> > k-means
> > more or less by accident - it is the first algorithm I could run with my
> > data - I was focused on providing stop-words & stop-regex filtering to my
> > input text vectors).
> >
> > Best regards,
> > Bogdan
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to