Hi Kenneth, Just to tell you that after training SRILM with -unk and adding the following code to my SRILM load function
_sri_ngramLM->skipOOVs() = false; I get the same score with SRILM and kenlm. Unfortunately this is not the case for IRSTLM. I'll look at my code because I think that there might be something wrong. Thanks again for your help. Regards -- Felipe El 29/10/10 16:09, Kenneth Heafield escribió: > kenlm's query tool implicitly places<s> at the beginning. It doesn't > appear in the output, but you can see the effect because the n-gram > length after the is 2, not 1. > > The difference between the kenlm result and SRILM is the unknown word > "74th". -55.599 + 1.13665 = -54.46235. The term -1.13665 appears to be > the LM's backoff weight for the unigram "and". I think including the > backoff is the right thing to do here and it's how Moses configures > SRILM to operate (so you may want to look at LanguageModelSRI.cpp and > copy how it initializes SRI). > > As to IRST, I hope they find the n-gram lengths and probabilities after > each word useful in explaining that difference. > > Kenneth > > On 10/29/10 08:55, Felipe Sánchez Martínez wrote: >> Hi Kenneth, >> >> The output of kenlm/query is: >> >> Loading the LM will be faster if you build a binary file. >> Reading english.5gram.lm >> ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100 >> **************************************************************************************** >> Language model is missing<unk>. Substituting probability 0. >> ************ >> >> Loading statistics: >> user 18.0001 >> sys 0.00047 >> rss 316632 kB >> the 2 -0.894835 fifth 3 -3.34651 committee 2 -3.04771 resumed 1 -5.3955 >> its 2 -1.99768 consideration 2 -3.4901 of 3 -0.281781 the 4 -0.240104 >> item 3 -4.40691 at 2 -2.55249 its 2 -2.06475 64th 1 -7.43317 and 1 >> -2.20519 74th 0 -1.13665 meetings 1 -3.82205 , 2 -1.05335 on 3 -2.12476 >> 15 3 -2.54839 may 4 -1.06142 and 4 -1.42049 2 3 -2.24962 june 4 >> -0.381742 2000 2 -1.75696 . 3 -0.68658</s> 4 -0.000255845 Total: -55.599 >> After queries: >> user 18.0001 >> sys 0.00047 >> rss 316656 kB >> Total time including destruction: >> user 18.0001 >> sys 0.00051 >> rss 1312 kB >> >> It seems that it is adding the end-of-sentence token, but not that of >> the begin of sentence. >> >> Score (-55.599) is different from SRILM (-54.4623) and from IRSTLM >> (-49.9141 or -55.3099 when adding<s> and</s>). >> >> Thanks for your help >> -- >> Felipe >> >> El 28/10/10 18:57, Kenneth Heafield escribió: >>> Hi Felipe, >>> >>> Please run $recent_moses_build/kenlm/query langmodel.lm<text and post >>> the output (you didn't need the statistics, just the line containing >>> "Total:"). That will tell you the score and n-gram length at each word. >>> >>> Kenneth >>> >>> On 10/28/10 12:42, Felipe Sánchez Martínez wrote: >>>> Hello all, >>>> >>>> My question is about SRILM and IRSTLM, it is not directly related to >>>> Moses, but I did not know where to ask. >>>> >>>> I am scoring individual sentences with a 5-gram language model and I get >>>> different scores with SRILM and IRSTLM. >>>> >>>> The language model was trained with SRILM through the following command >>>> line: >>>> >>>> $ srilm/bin/i686-m64/ngram-count -order $(LM_ORDER) -interpolate >>>> -kndiscount -text text.txt -lm langmodel.lm >>>> >>>> I do not know why when scoring the same sentence I get different scores. >>>> In this regard I have a few questions: >>>> * Does SRILM introduces begin-of-sentence and end-of-sentence tokens >>>> during training? >>>> * and, during scoring (or decoding)? >>>> * Does IRSTLM introduces begin-of-sentence and end-of-sentence tokens >>>> during scoring (or decoding)? >>>> * I know SRILM uses log base 10. Does IRSTLM also use log base 10? (It >>>> seems so) >>>> >>>> When I score the English sentence "the fifth committee resumed its >>>> consideration of the item at its 64th and 74th meetings , on 15 may and >>>> 2 june 2000 ." the score (log prob) I get are: >>>> SRILM: -54.4623 >>>> IRSTLM: -49.9141 >>>> >>>> if I introduce<s> and</s> when scoring with IRSTLM I get a log prob of >>>> -55.3099 (very similar to that of SRILM). >>>> >>>> The code to score with IRSTLM was borrowed from Moses. >>>> >>>> Than you very much for your help. >>>> >>>> Regards. >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] >>> http://mailman.mit.edu/mailman/listinfo/moses-support >> > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support -- Felipe Sánchez Martínez Departamento de Lenguajes y Sistemas Informáticos Universidad de Alicante, E-03071 Alicante (Spain) Tel.: +34 965 903 400, ext: 2966 Fax: +34 965 909 326 http://www.dlsi.ua.es/~fsanchez _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
