We are happy to announce the release of version 1.23 of the Ngram Statistics Package. This release focuses on the huge-count.pl utilities, which are intended to count large amounts of text using smaller amounts of memory than count.pl. While this works well in many cases, we did notice some situations when running with version 1.23 that caused us concern. In particular there are situations where huge-merge.pl can use an unexpectedly large amount of memory. While the situation is improved in 1.23, it remains a concern.
The following is the "Bug" notice that we've put in huge-merge.pl and CHANGES. We will continue to work on this, and welcome any suggestions, etc. about how to handle these kinds of situations. There is a limitation in huge-count.pl. When the size of the corpus is very large (>16G) and the some of the terms of the bigrams is very long (>30 chars), the program could run out of memory at huge-merge.pl step. This is because huge-merge use two hashes to count the frequencies of the first and second term of the bigrams. These two hashes could use up the memory with the increase of the length of the terms and the increase of the number of the terms. If just for normal text, terms are within limited length and numbers, the software won't use up the memory. If you are using huge-count.pl you will likely want to install the new version. If you are only using count.pl and statistic.pl, then this version remains the same as the previous one. You can find download links at http://ngram.sourceforge.net The direct CPAN link is : http://search.cpan.org/~tpederse/Text-NSP-1.23/ And also on sourceforge : https://sourceforge.net/projects/ngram/ Enjoy, Ted and Ying -- Ted Pedersen http://www.d.umn.edu/~tpederse