Hello,

another option is to perform data selection to only keep the data relevant to yout task. Usually you improve your performance, and as a nice side effect, you LM is much smaller ;-)

Many people use the algorithm proposed by Moore and Lewis, which is implemented in the freely available tool XenC (on github)

best,

Holger

On 11/25/2014 12:02 PM, Hoang Cuong wrote:
Hi Raj, Tom and Marcin,
I binarized the ARPA file last night, following your suggestion. In the end, it resulted a binarized LM file of roughly *100GB* (@Marcin - it is not 20-30GB as you suggest, is it okay with this size?) Fortunately, the infrastructure at my university allows me to run experiments with that.
Thanks a lot for your help.
It is so great to play with such huge LMs :))
Best,


On Mon, Nov 24, 2014 at 3:19 PM, Marcin Junczys-Dowmunt <[email protected] <mailto:[email protected]>> wrote:

    The command

    moses/bin/build_binary trie -a 22 -b 8 -q 8 lm.arpa lm.kenlm

    will build a compressed binarized model with quantization. You can run

    moses/bin/build_binary lm.arpa

    without any parameters to get size estimates for different
    parameter settings. I would guess you will get a binarized LM of
    roughly 20 to 30 GB which is managable (provided the size you gave
    us is that of an uncompressed text file). You can also use lmplz
    to build pruned models in the first place, these will be much
    smaller.

    W dniu 2014-11-24 15:11, Tom Hoar napisaƂ(a):

    After binarizing such a large ARPA file with KenLM, you'll need
    to configure your moses.ini file to "lazily load the model using
    mmap." This involves using lmodel-file code "9" vs code "8." More
    details here: https://kheafield.com/code/kenlm/moses/

    Performance improves significantly if you store the binarized
    file on an SSD.




    On 11/24/2014 07:00 PM, Raj Dabre wrote:
    Hey Hoang,
    You should binarize the arpa file.
    The readme of the LM tool (KenLM or IRSTLM or SRILM) will tell
    you how.
    Regards.

    On Mon, Nov 24, 2014 at 7:07 PM, Hoang Cuong
    <[email protected] <mailto:[email protected]>> wrote:

        Hi all,
        I have trained an (unpruned) 5-grams language model on a
        large corpus of 5 billion words, resulting an ARPA-format
        file of roughly 300GB (is it a normal LM size with such a
        big monolingual data?). This is obviously too big for
        running an SMT system.
        I read several works where their system uses language models
        trained on similar monolingual corpus. Could you give me
        some advice how to handle this, making it feasible to run
        SMT systems?
        I appreciate your help a lot,
        Best,
-- Best Regards,
        Hoang Cuong
        SMTNerd

        _______________________________________________
        Moses-support mailing list
        [email protected] <mailto:[email protected]>
        http://mailman.mit.edu/mailman/listinfo/moses-support




-- Raj Dabre.
    Research Student,
    Graduate School of Informatics,
    Kyoto University.
    CSE MTech, IITB., 2011-2014


    _______________________________________________
    Moses-support mailing list
    [email protected]  <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support


    _______________________________________________
    Moses-support mailing list
    [email protected]  <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support


    _______________________________________________
    Moses-support mailing list
    [email protected] <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support




--
/
Best Regards,
/
Hoang Cuong
/
/
SMTNerd
/
/


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to