Re: [Moses-support] unknown words in SRILM/Kenlm

Kenneth Heafield Thu, 05 Feb 2015 06:13:14 -0800

Hi,

        Great question!

        As described in Chen and Goodman, modified Kneser-Ney smoothing treats
<unk> as a count-0 unigrams.  All unigrams are interpolated with the
uniform distribution with weight backoff(empty string), so they get
backoff(empty string)/|vocabulary| mass just for being a word.  That's
the only mass that <unk> gets.  This is what KenLM does by default.  As
a corollary, p(<unk>) is always smaller than p(word) for any seen word.

        Footnote 7 on page 30 of my thesis mentions how SRILM does it:
http://kheafield.com/professional/thesis.pdf .  Here it is in more gory
detail.  I'm going from memory here because I currently work for a
for-profit.

1. First compute the probability of every word except <unk>, including
the aforementioned interpolation with unigrams.

2. Sum those probabilities and subtract from 1 to attain p(<unk>).  In
principle, this produces the same result of backoff(empty
string)/|vocabulary|.  However the sum is very close to 1 and p(<unk>)
is small, so this method is numerically imprecise.

3. SRILM checks if it calculated p(<unk>) > 3*10^-6 (which is the
hard-coded value of epsilon).  If so, which is only the case for very
tiny language models (otherwise |vocabulary| is big enough), it returns
p(<unk>).

4. If it calculated p(<unk>) < 3*10^-6, as it usually is, then it does
what the comments describe as "another hack".  This "disables" unigram
interpolation.  Interpolation with uniform has too terms: backoff(empty
string)/|vocabulary| + discounted probability where discounted
probability implicitly includes the 1-backoff(empty string) term.  It
just never adds backoff(empty string)/|vocabulary| to each unigram, but
the discounted probabilities were still implicitly multiplied by
1-backoff(empty string) when they were discounted.  In effect, compared
with Chen and Goodman, it steals backoff(empty string)/|vocabulary| from
every unigram.

5. SRILM again sums all the unigrams and takes 1 - their sum.  Because
each of the |vocabulary| - 1 terms had backoff(empty
string)/|vocabulary| stolen from it, p(<unk>) is now higher by

(|vocabulary| - 1) * backoff(empty string) / |vocabulary|

and it already owned backoff(empty string)/|vocabulary| of the
probability space, so then it becomes

|vocabulary| * backoff(empty string)/|vocabulary|

= backoff(empty string).  Therefore, SRILM's <unk> is larger than Chen
and Goodman say it should be, by a factor of |vocabulary|.  This
explains the famous issue that p(<unk>) can be higher than the
probability of words in the vocabulary.  With KenLM, you can emulate
this (IMHO broken) functionality by using --interpolate_unigrams 0.

Kenneth

On 02/05/2015 08:14 AM, koormoosh wrote:
> Hi,
> 
> I am trying to figure out how unknown words are being handled in
> SRILM/KenLM. I've searched inside the /lm/src directory but the grep
> matches are not helpful. I am interested in LM and doing some
> experiments with my own implementation of Kneser-Ney, so knowing how
> unknown words are handled is important to get roughly equal results with
> SRILM or KenLM. Any comments? A pointer to a class is appreciated the most.
> 
> * please note that I am not looking for a solution to handle unknown
> words, as I already have a solution for it. I want to know exactly how
> unknown words are being handled in SRILM.
> 
> thank you
> -Koormoosh
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] unknown words in SRILM/Kenlm

Reply via email to