Ok to be hones I do not get all of this - still a newbie in this matter.
Have some clarification questions:

The way I've done this is to take whatever unigram analyzer for tokenization
>  that
> fits what you want to do, wrap it in Lucene's ShingleAnalyzer, and use that
> as the
> "tokenizer" (which now produces ngram tokens as single tokens each),


you mean to use the unigram analyzer only to feed the Shingle Analyzer (as
part of lucene?), so the first one is not nativelly connected to lucene
directly but only produces input for lucene so to say?


> and run
> that
> through the LLR ngram M/R job (which ends by sorting descending by LLR
> score),
> and shove the top-K ngrams (and sometimes the unigrams which fit some
> "good"
> IDF range) into a big bloom filter, which is serialized and saved.
>

what is IDF and LLR do we have that in Mahout? I am only using k-means
algorithm from Mahout. How would you relate these? I lack some terminology
though.
what is bloom filter?


> With that, you can take that original ShingleAnalyzer you used previously,
> and to
> produce vectors, you take the ngram token stream output and check each
> emitted
> token to see if it is the bloom filter, if not, discard.  If it is, you can
> hash (or multiply
> hash it) it to get the ngram id for that token.  Of course, that doesn't
> properly
> normalize the columns of your term-document matrix (you don't have your IDF
> factors), but you can do that as a post-processing step after this one.


I will need some more time to start understanding this part ;)

I somehow understood better what Ted explained earlier:

> If you can get bigrams or trigrams indexed as single terms then the
k-means
> clustering should work just fine

and I could possibly do that one - write some small algorithm which finds
n-grams and feeds the initial term vectors before k-means task.
Is what you explained close to what Ted suggested or are they different
approaches?


-- 
Best regards,
Bogdan

Reply via email to