In the spirit of Jake's message, would anyone be opposed to a commit of MAHOUT-317? (https://issues.apache.org/jira/browse/MAHOUT-317)
It is a re-factoring of the LLR Collocation work to eliminate in-memory frequency calculations for ngram and n-1gram frequencies. Using a secondary sort eliminates the need to drop the ngrams into a map to collect frequencies for each unique ngram because with the proper ordering it is just a matter of accumulating counts for contiguous sets of grams. It is essential for the scalability of the ngram llr calculations. Unit tests are included and I've regression tested the patch against the original implementation on the 20news corpus -- it produces the same results. So, with the group's blessing I will commit. Drew