kenlm now supports quantization. To use it, svn up then run build_binary with -q:
kenlm/build_binary -q 8 trie foo.arpa foo.out for 8 bits. You can choose from 2 to 25 bits, inclusive. Currently, probability and backoff are quantized separately (in this case using 8 bits each). By default, -q applies to both probability and backoff. You can use -b to set the number of bits to use for backoff independently. As always, you can get a memory estimate by omitting the output file e.g. kenlm/build_binary -q 8 trie foo.arpa There are 2^bits - 1 probability values (one is reserved for blanks when SRI prunes where it shouldn't) and 2^bits - 2 non-zero backoff values (reserved values indicate zero backoff for n-grams that extend or don't extend to the right). Because these reserved values make the number of bins not a power of two, it's hard to support qARPA. IRSTLM doesn't optimize when a context is known not to extend, so they don't need two reserved backoff values. Currently using a simple reimplementation of IRSTLM's binning method. M. Federico and N. Bertoldi. 2006. How many bits are needed to store probabilities for phrase-based translation? In Proc. of the Workshop on Statistical Machine Translation, pages 94–101, New York City, June. Association for Computational Linguistics. Plugging in other quantization methods should be relatively simple now. Haven't done quality evaluation yet. It only works with trie. If you're quantizing, you're probably worried about memory, so Kenneth _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
