Hello all,

I am training the SMT baseline system using the data provided at
http://www.statmt.org/wmt09/translation-task.html on a 16 GB of RAM
Linux server. 

To train the language model I am using the corpora found at 
http://www.statmt.org/wmt09/training-monolingual.tar More precisely, I
am using the concatenation of the files europarl-v4.en.gz file and
news-train08.en.gz. Corpus is around 550 million words. 

The command line used to train the language model is:

srilm-1.5.7/bin/x86_64/ngram-count -order 5 -interpolate -kndiscount
-text corpus.lowercased -lm corpus.lm

It goes out of memory (16 GB!!) and starts using swap.

Is this normal? How could I deal with it without using a smaller corpus?

Someone knows why news-train08.en.gz is much larger than the rest of
news-train08 files?

Thanks in advance for you valuable help.

Regards

-- 
Felipe Sánchez Martínez <[EMAIL PROTECTED]>
Departamento de Lenguajes y Sistemas Informáticos
Universidad de Alicante, E-03071 Alicante (Spain)
Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326
http://www.dlsi.ua.es/~fsanchez

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to