Dear all, When using lmplz to generate a 6-gram model for POS-tag-like data (small vocabulary, no real word), lmplz sometimes failed to run depending the dataset.
The version of lmplz is built from the latest code from either
mosesdecoder or kenlm GitHub repo.
Command line is like
somedir/lmplz -o 6 -S 50% --text foo.txt --arpa foo.lm
Here is the stderr. The failing dataset is 2894562 lines, 92M.
=== 1/5 Counting and sorting n-grams ===
Reading /home/gumble/[somedir]/zh-cn-nw-pos1.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 41634632 types 114
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1368 2:241348592 3:452528608 4:724045824 5:1055900160
6:1448091648
/home/gumble/github/kenlm/lm/builder/adjust_counts.cc:59 in void
lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const
lm::builder::DiscountConfig&) threw BadDiscountException because
`discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
ERROR: 1-gram discount out of range for adjusted count 3: -0.2
Aborted (core dumped)
I used `awk 'BEGIN {srand()} !/^$/ { if (rand() <= .0001) print }'
zh-cn-nw-pos1.txt` to sample a few lines, which sometimes produces a
similar error, or successfully runs.
The attached failed sample produces the error "ERROR: 1-gram discount
out of range for adjusted count 2: -1.23077".
Thanks.
sample-fail.txt.gz
Description: application/gzip
sample-success.txt.gz
Description: application/gzip
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
