Dear all,

When using lmplz to generate a 6-gram model for POS-tag-like data (small
vocabulary, no real word), lmplz sometimes failed to run depending the
dataset.

The version of lmplz is built from the latest code from either
mosesdecoder or kenlm GitHub repo.

Command line is like

somedir/lmplz -o 6 -S 50% --text foo.txt --arpa foo.lm

Here is the stderr. The failing dataset is 2894562 lines, 92M.

=== 1/5 Counting and sorting n-grams ===
Reading /home/gumble/[somedir]/zh-cn-nw-pos1.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 41634632 types 114
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:1368 2:241348592 3:452528608 4:724045824 5:1055900160
6:1448091648
/home/gumble/github/kenlm/lm/builder/adjust_counts.cc:59 in void
lm::builder::{anonymous}::StatCollector::CalculateDiscounts(const
lm::builder::DiscountConfig&) threw BadDiscountException because
`discounts_[i].amount[j] < 0.0 || discounts_[i].amount[j] > j'.
ERROR: 1-gram discount out of range for adjusted count 3: -0.2
Aborted (core dumped)

I used `awk 'BEGIN {srand()} !/^$/ { if (rand() <= .0001) print }'
zh-cn-nw-pos1.txt` to sample a few lines, which sometimes produces a
similar error, or successfully runs.

The attached failed sample produces the error "ERROR: 1-gram discount
out of range for adjusted count 2: -1.23077".

Thanks.

Attachment: sample-fail.txt.gz
Description: application/gzip

Attachment: sample-success.txt.gz
Description: application/gzip

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to