Hi,

what you observe is not entirely unexpected.

What you could do:
(a) only train a 4-gram model
(b) train the 5-gram in parts and interpolate them (check SRILM web pages)

The English data is bigger, since we are crawling more English language
news sites.

-phi

On Fri, Nov 21, 2008 at 1:24 PM, Felipe Sánchez Martínez <
[EMAIL PROTECTED]> wrote:

>
> Hello all,
>
> I am training the SMT baseline system using the data provided at
> http://www.statmt.org/wmt09/translation-task.html on a 16 GB of RAM
> Linux server.
>
> To train the language model I am using the corpora found at
> http://www.statmt.org/wmt09/training-monolingual.tar More precisely, I
> am using the concatenation of the files europarl-v4.en.gz file and
> news-train08.en.gz. Corpus is around 550 million words.
>
> The command line used to train the language model is:
>
> srilm-1.5.7/bin/x86_64/ngram-count -order 5 -interpolate -kndiscount
> -text corpus.lowercased -lm corpus.lm
>
> It goes out of memory (16 GB!!) and starts using swap.
>
> Is this normal? How could I deal with it without using a smaller corpus?
>
> Someone knows why news-train08.en.gz is much larger than the rest of
> news-train08 files?
>
> Thanks in advance for you valuable help.
>
> Regards
>
> --
> Felipe Sánchez Martínez <[EMAIL PROTECTED]>
> Departamento de Lenguajes y Sistemas Informáticos
> Universidad de Alicante, E-03071 Alicante (Spain)
> Tel.: +34 965 903 400, ext: 2038 Fax: +34 965 909 326
> http://www.dlsi.ua.es/~fsanchez <http://www.dlsi.ua.es/%7Efsanchez>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to