Hi Raj, Tom and Marcin, I binarized the ARPA file last night, following your suggestion. In the end, it resulted a binarized LM file of roughly *100GB* (@Marcin - it is not 20-30GB as you suggest, is it okay with this size?) Fortunately, the infrastructure at my university allows me to run experiments with that. Thanks a lot for your help. It is so great to play with such huge LMs :)) Best,
On Mon, Nov 24, 2014 at 3:19 PM, Marcin Junczys-Dowmunt <[email protected]> wrote: > The command > > moses/bin/build_binary trie -a 22 -b 8 -q 8 lm.arpa lm.kenlm > > will build a compressed binarized model with quantization. You can run > > moses/bin/build_binary lm.arpa > > without any parameters to get size estimates for different parameter > settings. I would guess you will get a binarized LM of roughly 20 to 30 GB > which is managable (provided the size you gave us is that of an > uncompressed text file). You can also use lmplz to build pruned models in > the first place, these will be much smaller. > > W dniu 2014-11-24 15:11, Tom Hoar napisaĆ(a): > > After binarizing such a large ARPA file with KenLM, you'll need to > configure your moses.ini file to "lazily load the model using mmap." This > involves using lmodel-file code "9" vs code "8." More details here: > https://kheafield.com/code/kenlm/moses/ > > Performance improves significantly if you store the binarized file on an > SSD. > > > > > On 11/24/2014 07:00 PM, Raj Dabre wrote: > > Hey Hoang, > You should binarize the arpa file. > The readme of the LM tool (KenLM or IRSTLM or SRILM) will tell you how. > Regards. > > On Mon, Nov 24, 2014 at 7:07 PM, Hoang Cuong <[email protected]> > wrote: > >> Hi all, >> I have trained an (unpruned) 5-grams language model on a large corpus of >> 5 billion words, resulting an ARPA-format file of roughly 300GB (is it a >> normal LM size with such a big monolingual data?). This is obviously too >> big for running an SMT system. >> I read several works where their system uses language models trained on >> similar monolingual corpus. Could you give me some advice how to handle >> this, making it feasible to run SMT systems? >> I appreciate your help a lot, >> Best, >> -- >> Best Regards, >> Hoang Cuong >> SMTNerd >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > > > -- > Raj Dabre. > Research Student, > Graduate School of Informatics, > Kyoto University. > CSE MTech, IITB., 2011-2014 > > > _______________________________________________ > Moses-support mailing > [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing > [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support > > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- *Best Regards,Hoang CuongSMTNerd*
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
