Hi, what you observe is not entirely unexpected.
What you could do: (a) only train a 4-gram model (b) train the 5-gram in parts and interpolate them (check SRILM web pages) The English data is bigger, since we are crawling more English language news sites. -phi On Fri, Nov 21, 2008 at 1:24 PM, Felipe Sánchez Martínez < [EMAIL PROTECTED]> wrote: > > Hello all, > > I am training the SMT baseline system using the data provided at > http://www.statmt.org/wmt09/translation-task.html on a 16 GB of RAM > Linux server. > > To train the language model I am using the corpora found at > http://www.statmt.org/wmt09/training-monolingual.tar More precisely, I > am using the concatenation of the files europarl-v4.en.gz file and > news-train08.en.gz. Corpus is around 550 million words. > > The command line used to train the language model is: > > srilm-1.5.7/bin/x86_64/ngram-count -order 5 -interpolate -kndiscount > -text corpus.lowercased -lm corpus.lm > > It goes out of memory (16 GB!!) and starts using swap. > > Is this normal? How could I deal with it without using a smaller corpus? > > Someone knows why news-train08.en.gz is much larger than the rest of > news-train08 files? > > Thanks in advance for you valuable help. > > Regards > > -- > Felipe Sánchez Martínez <[EMAIL PROTECTED]> > Departamento de Lenguajes y Sistemas Informáticos > Universidad de Alicante, E-03071 Alicante (Spain) > Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326 > http://www.dlsi.ua.es/~fsanchez <http://www.dlsi.ua.es/%7Efsanchez> > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
