Hi Bogdan, Sorry about that, I kinda jumped into implementation details related to what we've been discussing on the mahout-dev list - much of what I described we don't yet have in mahout, and I was talking about some of the ideas on how we can incorporate it in!
On Wed, Jan 6, 2010 at 5:53 PM, Bogdan Vatkov <[email protected]>wrote: > Ok to be hones I do not get all of this - still a newbie in this matter. > Have some clarification questions: > > > The way I've done this is to take whatever unigram analyzer for > > tokenization that fits what you want to do, wrap it in Lucene's > > ShingleAnalyzer, and use that as the "tokenizer" (which now > > produces ngram tokens as single tokens each), > > > you mean to use the unigram analyzer only to feed the Shingle Analyzer (as > part of lucene?), so the first one is not nativelly connected to lucene > directly but only produces input for lucene so to say? > Mahout has lucene jars pulled in, and we use them for things like analyzing / tokenizing text. For example, included in the lucene jars we link to, there is the StandardAnalyzer, which does tokenization, lower-casing, stop word removal, and some cleanup. It takes in your text and spits out a stream of tokens, which for most Analyzer implementations, are unigrams. Also available in lucene, is the ShingleAnalyzer, and the Analyzer class itself allows for chaining, so you can do things like Analyzer myAnalyzer = new ShingleAnalyzer(3, new StandardAnalyzer()); (where "3" is just saying you only want up to 3-grams) What you have at this point is an Analyzer instance which takes in raw Reader instances (like StringReader), and spits out a stream of ngrams. > > > > and run > > that > > through the LLR ngram M/R job (which ends by sorting descending by LLR > > score), > > and shove the top-K ngrams (and sometimes the unigrams which fit some > > "good" > > IDF range) into a big bloom filter, which is serialized and saved. > > > > what is IDF and LLR IDF is "Inverse Document Frequency", a weighting scheme used in lucene and elsewhere in IR and NLP which helps boost the effective weight of terms which are more rare (numerically, the IDF of a term which occurs in k out of N documents is usually taken to be something logarithmic like log( N / k ) ). LLR is the "Log-Likelihood Ratio", which for token n-grams, is a measure of how statistically significant each n-gram is (ie. if tokens "A" and "B" occur with frequencies f(A) and f(B), the LLR of the bigram "A B" is a measure of how much more frequently "A B" occurs relative to how much they frequently they would occur if they were uncorrelated). In some ways you can look at IDF as LLR for unigrams. > do we have that in Mahout? I am only using k-means > algorithm from Mahout. How would you relate these? I lack some terminology > though. > What I described is a pipeline taking text, tokenizing and finding the top statistically significant ngrams (you don't want all of them - most are either fairly rare, or are just the combination of a common unigram and another unigram), and then making Mahout SparseVector instances out of the documents, which you can then feed to k-means. and I could possibly do that one - write some small algorithm which finds > n-grams and feeds the initial term vectors before k-means task. > Is what you explained close to what Ted suggested or are they different > approaches? > I was basically describing what Ted was talking about, just in a bit more intricate detail. Using a simple ShingleAnalyzer wrapped around a StandardAnalyzer from lucene and you've got your "little algorithm" to pull out ngrams which can be fed to k-means. You will just get a lot of noise if you don't do a little bit of work to make sure you only have the "good" ngrams. I would really like to have this whole pipeline in Mahout, in a totally user-friendly way, but while the pieces exist, the whole pipleline hasn't been put together quite yet (although you could certainly write it yourself - we love patches! :) ) -jake
