We are happy to announce the release of version 1.23 of the Ngram
Statistics Package. This release focuses on the huge-count.pl
utilities, which are intended to count large amounts of text using
smaller amounts of memory than count.pl. While this works well in many
cases, we did notice some situations when running with version 1.23
that caused us concern. In particular there are situations where
huge-merge.pl can use an unexpectedly large amount of memory. While
the situation is improved in 1.23, it remains a concern.

The following is the "Bug" notice that we've put in huge-merge.pl and
CHANGES. We will continue to work on this, and welcome any
suggestions, etc. about how to handle these kinds of situations.

There is a limitation in huge-count.pl. When the size of the corpus is
 very large (>16G) and the some of the terms of the bigrams is very long
(>30 chars), the program could run out of memory at huge-merge.pl step.
This is because huge-merge use two hashes to count the frequencies of
the first and second term of the bigrams. These two hashes could use up
the memory with the increase of the length of the terms and the increase
of the number of the terms. If just for normal text, terms are within
limited length and numbers, the software won't use up the memory.

If you are using huge-count.pl you will likely want to install the new
version. If you are only using count.pl and statistic.pl, then this
version remains the same as the previous one.

You can find download links at http://ngram.sourceforge.net

The direct CPAN link is : http://search.cpan.org/~tpederse/Text-NSP-1.23/
And also on sourceforge : https://sourceforge.net/projects/ngram/

Enjoy,
Ted and Ying


-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Reply via email to