(my message bounced as it was too long ... here is a truncated version) Miles
---------- Forwarded message ---------- From: Miles Osborne <[EMAIL PROTECTED]> Date: 2008/8/14 Subject: Re: [Moses-support] Fwd: Moses: Prepare Data, Build Language Model and Train Model To: Llio Humphreys <[EMAIL PROTECTED]> Cc: moses-support <[email protected]> building language models (using for example ngram-count) is computationally expensive. from what you tell the list, it seems that you don't have enough physical memory to run it properly. you have a number of options: --specify a lower order model (eg 4 rather than 5, or even 3); depending upon how much monolingual training material you have, this may not produce worse results and it will certainly run faster and will require less space. --divide your language model training material into chunks and run ngram-count on each chunk. this is one strategy for building LMs using all of the Giga word corpus (when you don't have access to a 64 bit machine). here you would create multiple LMs. --use a disk-based method of creating them. we have done this, and basically it trades speed for time. --take the radical option and simply don't bother smoothing at all (ie use Google's "stupid backoff"). this makes training LMs trivial --just compute the counts of ngrams and work-out how to store them. i reckon it should be possible to do this and create an ARPA file suitable for loading into the SRILM. --buy more machines. Miles
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
