Dear All,
I'm trying to build a lm using a large dataset (> 11 M sentences). I have
generated the Arpa format with irstlm and now I'd like to binarize it using
kenlm.

I have called the build_binary to estimate memory usage, and I got this

Memory estimate:
type       MB
probing 16129 assuming -p 1.5
trie     7462 without quantization
trie     4361 assuming -q 8 -b 8 quantization
trie     6440 assuming -a 22 array pointer compression
trie     3339 assuming -a 22 -q 8 -b 8 array pointer compression and
quantization

then I run the binarization in this way:

/nfs/staging/turchmo/moses/kenlmNew/build_binary -i -t /tmp/ -q 8 -b 8 trie
irstLM.ARPA.txt irstLanguageModel.binary.lm

but I got this error:

lm/search_trie.cc:409 in void
lm::ngram::trie::<unnamed>::SanityCheckCounts(const std::vector<long
unsigned int, std::allocator<long unsigned int> >&, const std::vector<long
unsigned int, std::allocator<long unsigned int> >&) threw util::Exception'.
Longest count should be constant but it changed from 289546423 to 289546405
Byte: 37297517525

I have had a look into the mailing list, but I do not find any post with the
same error.

Any ideas?

Thanks a lot
Marco
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to