On 07/13/11 15:53, Philipp Koehn wrote: > Hi, > > But you're asking for a third piece of information. If you query for > "foo bar baz" and I can tell you that it will never extend to "* foo bar > baz" for any word * (due to pruning or filtering), then you need only > remember "foo bar" (or even less). The trie knows this but because the > pointers are equal but it currently isn't telling you. Probing could > tell you this if I used the otherwise-unused probability sign bit to > encode it. > > > The thinking is here, given a prefix of "A B C D" and a language model > of order 5, then we can ignore D if the ngram "A B C" is unknown. > > Why? Because if "A B C" is unknown, then also any "* A B C" will be > unknown, assuming sane low-count pruning.
SRILM's default pruning is insane under this categorization. This is why we had users complaining about trie a while back (until I fixed it to work around SRI). I can, and do, return ngram_length 3 for "B C D E" and ngram_length 5 for "A B C D E". That said, the code knows about this situation and could return the value you're looking for. By contrast, scrolling right ngram_length increases by at most one every time, because "A B C D E" implies "A B C D". > So, there will always be > free back-off to the lower order n-gram. > > Knowing that there is no "* A B C D" is the language model may > not be helpful, since different "* A B C" have different backoff costs. But these backoff costs are purely a function of "* A B C" and so you only need to remember (in your left state) that the backoff should be charged once * is known. That's smaller than remembering D and permits more recombination. This is slowly converging to KenLM taking responsibility for left state, right state, and a merge operation that outputs rest cost and probability. > > -phi _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
