Hello,
I am puzzled with the way SRI calculates the Kneser-Ney and
modified-Kneser-Ney probabilities. I would appreciate it if anyone who has
been "carefully" using these packages, or developed them could help me to
figure this out. please note that I have spent more than 50 hours figuring
this out and keep getting mixed outputs of expected and unexpected
perplexity scores. And note that I've read Goodman-Chen paper a few times,
and checked out the SRILM FAQ, discount, etc pages and codes and still this
is not clear how the probabilities are calculated. Following the
computation steps SRILM released on their web-page (
http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html)
doesn't produce the same result as the package itself (sometimes it does,
and sometimes it doesn't). For simplicity pretend that in the test data
there is no OOV words, and no pruning/refinement is happening in the
training time. what I am looking for is the math behind these computations
and not the hack. In below I am writing one possibility of calculating KN
and m-KN, I would appreciate it a lot if you could leave your comments when
you know the assumption I made is inconsistent with SRI/KenLM.
*For Kneser-Ney:*
- SRI uses the actual count and not the continuation counts for the highest
order ngrams and any ngrams that start with <s>
- SRI for Kneser-Ney uses one single discount D (= n_1 / (n_1 + 2*n_2 ) ),
where n_1,n_2 are calculated based on the size of the ngram. So for
example, in 3-gram model, the D is calculated based on the count of 3-grams
of frequency 1, and 2. And more importantly, if I understood correctly,
they use the same D (that is calculated based on 3-gram) even when they
backoff to 2-gram. Then the formulation becomes:
P(c|ab) = max{c(abc)-*D*,0} / c(ab) + D * N1+(ab.) / c(ab) * P(c|b)
gamma(ab) = N_1 (ab .) *D_1 + N_2 (ab .) * D_2 + N_+3 (ab .) * D_+3
D is calculated *based on 3-gram order*
P(c|b) = max{N1+(.bc)-*D*,0} / N1+(.b.) + D * N1+(b.) / N1+(.b.) * P(c)
D is calculated *based on 3-gram order* *(same discount as the highest
order)*
P(c) = N1+(.c) / N1+(..)
is this correct?
*For modified Kneser-Ney:*
I am making the following assumptions about modified-KN implementation and
I would appreciate:
- Similar to SRI, the actual counts are used for the highest order ngram,
and those that start with <s>. For the lower orders, the counts are just
continuation counts.
- Discounts are not tied together anymore, each level of backoff has its
own discount. That discount itself is calculated based on the actual count
(for the highest order, and those starting with <s>) or continuation count
(lower orders) of the ngram sent to that level. So for example, for the
3gram case we can write the followings:
P(c|ab) = max{c(abc)-D(c(abc)),0} / c(ab) + gamma(ab) / c(ab) * P(c|b)
gamma(ab) = N_1 (ab .) *D_1 + N_2 (ab .) * D_2 + N_+3 (ab .) * D_+3
D_1,D_2,D_3 are calculated *based on 3-gram order*
P(c|b) = max{N1+(. bc)-*D(N1+(.bc)*),0} / N1+(.b.) + gamma(b) / N1+(.b.) *
P(c)
gamma(b) = N_1 (b .) *D_1 + N_2 (b .) * D_2 + N_+3 (b .) * D_+3
D_1,D_2,D_3 are calculated *based on 2-gram order*
P(c) = N1+(.c) / N1+(..)
is this how SRI does it?
Thanks,
Koorm
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support