On 07/13/11 15:53, Philipp Koehn wrote:
> Hi,
> 
>     But you're asking for a third piece of information.  If you query for
>     "foo bar baz" and I can tell you that it will never extend to "* foo bar
>     baz" for any word * (due to pruning or filtering), then you need only
>     remember "foo bar" (or even less).  The trie knows this but because the
>     pointers are equal but it currently isn't telling you.  Probing could
>     tell you this if I used the otherwise-unused probability sign bit to
>     encode it.
> 
> 
> The thinking is here, given a prefix of "A B C D" and a language model
> of order 5, then we can ignore D if the ngram "A B C" is unknown.
> 
> Why? Because if "A B C" is unknown, then also any "* A B C" will be
> unknown, assuming sane low-count pruning. 

SRILM's default pruning is insane under this categorization.  This is
why we had users complaining about trie a while back (until I fixed it
to work around SRI).  I can, and do, return ngram_length 3 for "B C D E"
and ngram_length 5 for "A B C D E".  That said, the code knows about
this situation and could return the value you're looking for.

By contrast, scrolling right ngram_length increases by at most one every
time, because "A B C D E" implies "A B C D".

> So, there will always be
> free back-off to the lower order n-gram.

> 
> Knowing that there is no "* A B C D" is the language model may
> not be helpful, since different "* A B C" have different backoff costs.

But these backoff costs are purely a function of "* A B C" and so you
only need to remember (in your left state) that the backoff should be
charged once * is known.  That's smaller than remembering D and permits
more recombination.

This is slowly converging to KenLM taking responsibility for left state,
right state, and a merge operation that outputs rest cost and probability.

> 
> -phi
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to