Hi,
here is a description of the ARPA format used for language model :

http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html

Michael Zuckerman wrote:
> Hi,
>
> Could you please explain about the format of .lm file generated by the 
> script ngram-count. For example, I got .lm file that starts with:
>
> \data\
> ngram 1=76288
> ngram 2=1644644
> ngram 3=1410926
> ngram 4=1393383
> ngram 5=1071864
>
> \1-grams:
> -2.815075       !       -1.648233
> -3.10526        "       -0.4596801
> -6.09184        #       -0.1521228
> -4.628769       $       -0.2417951
> -3.474399       %       -0.7403963
> -4.398747       &       -0.7879647
> -2.462822       '       -0.6111439
>
> If I understand correctly "ngram 1=76288" means that there are 76288 
> ngrams containing one token (word), and so on.
> But what do the negative numbers before and after the tokens mean ? 
> Also I noticed that sometimes the numbers after the tokens are 
> missing. What does it mean ?
>
> Thank you very much,
>      Michael.
> ------------------------------------------------------------------------
>
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support
>   


-- 
     Alexandre Allauzen
 Univ. Paris XI, LIMSI-CNRS
Tel : 01.69.85.80.64 (80.88)
Bur : 114     LIMSI Bat. 508
     [EMAIL PROTECTED]

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to