Hi,
Great question!
As described in Chen and Goodman, modified Kneser-Ney smoothing treats
<unk> as a count-0 unigrams. All unigrams are interpolated with the
uniform distribution with weight backoff(empty string), so they get
backoff(empty string)/|vocabulary| mass just for being a word. That's
the only mass that <unk> gets. This is what KenLM does by default. As
a corollary, p(<unk>) is always smaller than p(word) for any seen word.
Footnote 7 on page 30 of my thesis mentions how SRILM does it:
http://kheafield.com/professional/thesis.pdf . Here it is in more gory
detail. I'm going from memory here because I currently work for a
for-profit.
1. First compute the probability of every word except <unk>, including
the aforementioned interpolation with unigrams.
2. Sum those probabilities and subtract from 1 to attain p(<unk>). In
principle, this produces the same result of backoff(empty
string)/|vocabulary|. However the sum is very close to 1 and p(<unk>)
is small, so this method is numerically imprecise.
3. SRILM checks if it calculated p(<unk>) > 3*10^-6 (which is the
hard-coded value of epsilon). If so, which is only the case for very
tiny language models (otherwise |vocabulary| is big enough), it returns
p(<unk>).
4. If it calculated p(<unk>) < 3*10^-6, as it usually is, then it does
what the comments describe as "another hack". This "disables" unigram
interpolation. Interpolation with uniform has too terms: backoff(empty
string)/|vocabulary| + discounted probability where discounted
probability implicitly includes the 1-backoff(empty string) term. It
just never adds backoff(empty string)/|vocabulary| to each unigram, but
the discounted probabilities were still implicitly multiplied by
1-backoff(empty string) when they were discounted. In effect, compared
with Chen and Goodman, it steals backoff(empty string)/|vocabulary| from
every unigram.
5. SRILM again sums all the unigrams and takes 1 - their sum. Because
each of the |vocabulary| - 1 terms had backoff(empty
string)/|vocabulary| stolen from it, p(<unk>) is now higher by
(|vocabulary| - 1) * backoff(empty string) / |vocabulary|
and it already owned backoff(empty string)/|vocabulary| of the
probability space, so then it becomes
|vocabulary| * backoff(empty string)/|vocabulary|
= backoff(empty string). Therefore, SRILM's <unk> is larger than Chen
and Goodman say it should be, by a factor of |vocabulary|. This
explains the famous issue that p(<unk>) can be higher than the
probability of words in the vocabulary. With KenLM, you can emulate
this (IMHO broken) functionality by using --interpolate_unigrams 0.
Kenneth
On 02/05/2015 08:14 AM, koormoosh wrote:
> Hi,
>
> I am trying to figure out how unknown words are being handled in
> SRILM/KenLM. I've searched inside the /lm/src directory but the grep
> matches are not helpful. I am interested in LM and doing some
> experiments with my own implementation of Kneser-Ney, so knowing how
> unknown words are handled is important to get roughly equal results with
> SRILM or KenLM. Any comments? A pointer to a class is appreciated the most.
>
> * please note that I am not looking for a solution to handle unknown
> words, as I already have a solution for it. I want to know exactly how
> unknown words are being handled in SRILM.
>
> thank you
> -Koormoosh
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support