RE: "...it cost 9G memory when the corpus size is 150M to train a 5-gram language model." This is difficult to estimate. It's highly dependent on the corpus content.
RE IRSTLM, It supports splitting the corpus into multiple segments with the -k option for build-lm.sh. You'll have to experiment to see what size works for you. You might want to try RandLM if your multi-gigabyte corpus is too big for IRSTLM. Tom On Wed, 17 Aug 2011 14:51:16 +0800, "Li Xianhua" wrote: Hello everyone, Recently we're working on building a huge language model. The corpus size is about 3G, and our computer memory is 40G. We failed to build a 5-gram language model with SRILM because of insufficient memory. We divided the corpus into 2 parts and trained language model on both of them separately, however ,this still failed. I would like to know, how huge the corpus could approximately be when training a 5-gram language model from SRILM? As my colleague reported, it cost 9G memory when the corpus size is 150M to train a 5-gram language model. Is this normal? We are now trying to use IRSTLM. Is there any suggestions ? ---------------------------------------------------- Best wishes! Xianhua Li
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
