On Fri, Jan 8, 2010 at 7:03 AM, Grant Ingersoll <[email protected]> wrote:
> > On Jan 7, 2010, at 7:57 PM, Ted Dunning wrote: > > > The pieces are laying around. > > > > I had a framework like this for recs and text analysis at Veoh, Jake has > > something in LinkedIn. > > > > But the amount of code is relatively small and probably could be > rewritten > > before Jake can get clearance to release anything. > > > > The first step is to just count n-grams. I think that the input should > be > > relatively flexible and if you assume parametrized use of Lucene > analyzers, > > then all that is necessary is a small step up from word counting. > > The classification stuff has this already, in MR form, independent of > Lucene. > > > This > > should count all n-grams from 0 up to a limit. It should also allow > > suppression of output of any counts less than a threshold. Total number > of > > n-grams of each size observed should be accumulated. > > I believe it does this, too. Robin? > Yeah, Brute force ngram generation is done by Bayes Classifier. Beware its practically combinatorial explosion of data. But enough machines can tame it well. Take a look at the DictionaryVectorizer . If LLR job could be added in a chain, I could use that information while creating vectors. https://issues.apache.org/jira/browse/MAHOUT-237 I like the Formulation that Drew made, using n-1 grams to generate n-grams. It was the same I used to generate n-grams here, http://thinking.me/(Himanshu and I built it when I was still in college). But that was just a php script which iterates over a sample of twitter data :). One interesting thing I found was that any ngram with LLR <1 is practically junk, anything over LLR>50 is pretty awesome. between 1-50, its always debatable. This holds approximately true for large and small datasets. I will be really happy if Drew can work on the LLR based bigram generation code and help me attach it with the rest of the dictionaryVectorizer Also I would prefer if the the entire mahout code agrees upon a single format for document input. I would suggest we stick to SequenceFiles with key as docid and value as document content. That way, the creation of sequence files, we leave it to the user. Robin
