here is a description of the ARPA format used for language model :


Michael Zuckerman wrote:
> Hi,
> Could you please explain about the format of .lm file generated by the 
> script ngram-count. For example, I got .lm file that starts with:
> \data\
> ngram 1=76288
> ngram 2=1644644
> ngram 3=1410926
> ngram 4=1393383
> ngram 5=1071864
> \1-grams:
> -2.815075       !       -1.648233
> -3.10526        "       -0.4596801
> -6.09184        #       -0.1521228
> -4.628769       $       -0.2417951
> -3.474399       %       -0.7403963
> -4.398747       &       -0.7879647
> -2.462822       '       -0.6111439
> If I understand correctly "ngram 1=76288" means that there are 76288 
> ngrams containing one token (word), and so on.
> But what do the negative numbers before and after the tokens mean ? 
> Also I noticed that sometimes the numbers after the tokens are 
> missing. What does it mean ?
> Thank you very much,
>      Michael.
> ------------------------------------------------------------------------
> _______________________________________________
> Moses-support mailing list
> Moses-support@mit.edu
> http://mailman.mit.edu/mailman/listinfo/moses-support

     Alexandre Allauzen
 Univ. Paris XI, LIMSI-CNRS
Tel : (80.88)
Bur : 114     LIMSI Bat. 508

Moses-support mailing list

Reply via email to