Dear All, I'm trying to build a lm using a large dataset (> 11 M sentences). I have generated the Arpa format with irstlm and now I'd like to binarize it using kenlm.
I have called the build_binary to estimate memory usage, and I got this Memory estimate: type MB probing 16129 assuming -p 1.5 trie 7462 without quantization trie 4361 assuming -q 8 -b 8 quantization trie 6440 assuming -a 22 array pointer compression trie 3339 assuming -a 22 -q 8 -b 8 array pointer compression and quantization then I run the binarization in this way: /nfs/staging/turchmo/moses/kenlmNew/build_binary -i -t /tmp/ -q 8 -b 8 trie irstLM.ARPA.txt irstLanguageModel.binary.lm but I got this error: lm/search_trie.cc:409 in void lm::ngram::trie::<unnamed>::SanityCheckCounts(const std::vector<long unsigned int, std::allocator<long unsigned int> >&, const std::vector<long unsigned int, std::allocator<long unsigned int> >&) threw util::Exception'. Longest count should be constant but it changed from 289546423 to 289546405 Byte: 37297517525 I have had a look into the mailing list, but I do not find any post with the same error. Any ideas? Thanks a lot Marco
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
