We do have partial framework for this including log-likelihood ratio test computation.
For the most part, we don't have anything that specifically counts bigrams and words and arranges the counts in the right order for application, but that is relatively easy to write for map-reduce. I would be happy to provide pointers on the tricks I have seen to make that easy to do if you wanted to actually type the semi-colons and such. On Tue, Jan 5, 2010 at 9:02 AM, zaki rahaman <[email protected]> wrote: > Pardon my ignorance as this is probably best handled by an NLP package like > GATE or LingPipe, but does Mahout provide anything for collocations? Or > does > anyone know of a MapReducible way to calculate something like t-values for > tokens in N-grams? I've got quite a large collection that I have to prune, > filter, and preprocess, but I still expect it to be a significant size. > > -- > Zaki Rahaman > -- Ted Dunning, CTO DeepDyve
