(my message bounced as it was too long ... here is a truncated  version)

Miles

---------- Forwarded message ----------
From: Miles Osborne <[EMAIL PROTECTED]>
Date: 2008/8/14
Subject: Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model
and Train Model
To: Llio Humphreys <[EMAIL PROTECTED]>
Cc: moses-support <[email protected]>


building language models (using for example ngram-count) is computationally
expensive.  from what you tell the list, it seems that you don't have enough
physical memory to run it properly.

you have a number of options:

--specify a lower order model (eg 4 rather than 5, or even 3);  depending
upon how much monolingual training material you have, this may not produce
worse results  and it will certainly run faster and will require less space.

--divide your language model training material into chunks and run
ngram-count on each chunk.  this is one strategy for building LMs using all
of the Giga word corpus (when you don't have access to a 64 bit machine).
here you would create multiple LMs.

--use a disk-based method of creating them.  we have done this, and
basically it trades speed for time.

--take the radical option and simply don't bother smoothing at all (ie use
Google's "stupid backoff").  this makes training LMs trivial --just compute
the counts of ngrams and work-out how to store them.  i reckon it should be
possible to do this and create an ARPA file suitable for loading into the
SRILM.

--buy more machines.

Miles
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to