Hi, here is a description of the ARPA format used for language model : http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html
Michael Zuckerman wrote: > Hi, > > Could you please explain about the format of .lm file generated by the > script ngram-count. For example, I got .lm file that starts with: > > \data\ > ngram 1=76288 > ngram 2=1644644 > ngram 3=1410926 > ngram 4=1393383 > ngram 5=1071864 > > \1-grams: > -2.815075 ! -1.648233 > -3.10526 " -0.4596801 > -6.09184 # -0.1521228 > -4.628769 $ -0.2417951 > -3.474399 % -0.7403963 > -4.398747 & -0.7879647 > -2.462822 ' -0.6111439 > > If I understand correctly "ngram 1=76288" means that there are 76288 > ngrams containing one token (word), and so on. > But what do the negative numbers before and after the tokens mean ? > Also I noticed that sometimes the numbers after the tokens are > missing. What does it mean ? > > Thank you very much, > Michael. > ------------------------------------------------------------------------ > > _______________________________________________ > Moses-support mailing list > Moses-support@mit.edu > http://mailman.mit.edu/mailman/listinfo/moses-support > -- Alexandre Allauzen Univ. Paris XI, LIMSI-CNRS Tel : 01.69.85.80.64 (80.88) Bur : 114 LIMSI Bat. 508 [EMAIL PROTECTED] _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support