On Jan 7, 2010, at 7:57 PM, Ted Dunning wrote: > The pieces are laying around. > > I had a framework like this for recs and text analysis at Veoh, Jake has > something in LinkedIn. > > But the amount of code is relatively small and probably could be rewritten > before Jake can get clearance to release anything. > > The first step is to just count n-grams. I think that the input should be > relatively flexible and if you assume parametrized use of Lucene analyzers, > then all that is necessary is a small step up from word counting.
The classification stuff has this already, in MR form, independent of Lucene. > This > should count all n-grams from 0 up to a limit. It should also allow > suppression of output of any counts less than a threshold. Total number of > n-grams of each size observed should be accumulated. I believe it does this, too. Robin? > There should also be > some provision for counting cooccurrence pairs within windows or between two > fields. > > The second step is to detect interesting n-grams. This is done using the > counts of words and (n-1)-grams and the relevant totals as input for the LLR > code. > > The final (optional) step is creation of a Bloom filter table. Options > should control size of the table and number of probes. > > Building up all these pieces and connecting them is a truly worthy task. > > On Thu, Jan 7, 2010 at 3:44 PM, zaki rahaman <[email protected]> wrote: > >> @Ted, where is the partial framework you're referring to. And yes this is >> definitely something I would like to work on if pointed in the right >> direction. I wasn't quite sure though just b/c I remember a long-winded >> discussion/debate a while back on the listserv about what Mahout's purpose >> should be. N-gram LLR for collocations seems like a very NLP type of thing >> to have (obviously it could also be used in other applications as well but >> by itself its NLP to me) and from my understanding the "consensus" is that >> Mahout should focus on scalable machine learning. >> > > > > -- > Ted Dunning, CTO > DeepDyve
