In the spirit of Jake's message, would anyone be opposed to a commit
of MAHOUT-317? (https://issues.apache.org/jira/browse/MAHOUT-317)

It is a re-factoring of the LLR Collocation work to eliminate
in-memory frequency calculations for ngram and n-1gram frequencies.
Using a secondary sort eliminates the need to drop the ngrams into a
map to collect frequencies for each unique ngram because with the
proper ordering it is just a matter of accumulating counts for
contiguous sets of grams. It is essential for the scalability of the
ngram llr calculations.

Unit tests are included and I've regression tested the patch against
the original implementation on the 20news corpus -- it produces the
same results.

So, with the group's blessing I will commit.

Drew

Reply via email to