Pardon my ignorance as this is probably best handled by an NLP package like GATE or LingPipe, but does Mahout provide anything for collocations? Or does anyone know of a MapReducible way to calculate something like t-values for tokens in N-grams? I've got quite a large collection that I have to prune, filter, and preprocess, but I still expect it to be a significant size.
-- Zaki Rahaman
