Hello all, I am training the SMT baseline system using the data provided at http://www.statmt.org/wmt09/translation-task.html on a 16 GB of RAM Linux server.
To train the language model I am using the corpora found at http://www.statmt.org/wmt09/training-monolingual.tar More precisely, I am using the concatenation of the files europarl-v4.en.gz file and news-train08.en.gz. Corpus is around 550 million words. The command line used to train the language model is: srilm-1.5.7/bin/x86_64/ngram-count -order 5 -interpolate -kndiscount -text corpus.lowercased -lm corpus.lm It goes out of memory (16 GB!!) and starts using swap. Is this normal? How could I deal with it without using a smaller corpus? Someone knows why news-train08.en.gz is much larger than the rest of news-train08 files? Thanks in advance for you valuable help. Regards -- Felipe Sánchez Martínez <[EMAIL PROTECTED]> Departamento de Lenguajes y Sistemas Informáticos Universidad de Alicante, E-03071 Alicante (Spain) Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326 http://www.dlsi.ua.es/~fsanchez _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
