Hi,
binarizing like this gives you a lot smaller file:

build_binary trie -a 22 -b 8 -q 8 lm.arpa.gz lm.kenlm

This uses quantization, in theory that could cause quality loss, but I never saw that happen. Remove "-b 8 -q 8" if you are afraid of that, the file will be larger, but still a lot smaller than what you have. That's about all I do. You said "100 MERT iterations" ... what do you mean by that? Also the LM uses memory mapping in shared memory, so running several moses instances in parallel does not use additional memory due to the LM, similar for the phrase table.


W dniu 25.04.2015 o 21:05, liling tan pisze:
Dear Moses devs/users,

I've automated the binarization for phrase-model and reordering-model and added the multi-threadness to the filter script in moses (https://github.com/moses-smt/mosesdecoder/pull/109). Loading the binarized and filterd translation models work fine.

The issue now comes to huge language models.

I've a 38GB compressed arpa language model from a 16GB of raw text. Then I binarized it with "moses/bin/build_binary" and it grows to 71GB. It works pretty fine if i don't tune my system but when MERT tuning on 100 iterations on 71GB, it's taking almost forever to tune.

I did a google search and found KenLM's filter: https://kheafield.com/code/kenlm/filter/

But i'm clueless as to how to make it work.

*What should I do to the LM after binarization? *
*
*
*Is there any other steps to manipulate large language models to reduce the computing load when tuning?*
*
*
*What is the usual way to tune on a large LM file?*

@Marcin, how did you deal with the large LM file when tuning?


Regards,
Liling

On Tue, Apr 21, 2015 at 7:48 PM, liling tan <[email protected] <mailto:[email protected]>> wrote:

    Dear Moses dev/users,


    @Marcin, the bigger than usual reordering-table is due to our
    allowance for high distortion. 2.4 is after cleaning it up, the
    original size contains loads of rubbish sentence pairs.

    BTW, the compactization finished at <4hrs. I guess at the 3rd hour
    i was starting to doubt whether the server can handle that amount.

    But the phrase size didn't go down as much as i expect, it's
    still 1.1G which might take forever to load when decoding. Will
    .minphr file be faster to load (it looks binarized, i think) than
    the normal .gz phrase table? If not, we're still looking at >18hrs
    of loading time on the server.

    But the reordering went down to from 6.7GB -> 420M.

    What exactly is the process of dealing with models >4GB? The
    standard moses tutorial on the "moses rights of passage" and
    processes would be failing at every instances when considering
    non-binarized LM, non-compactize phrase-table/lexical-table,
    non-threaded processing/training/decoding.

    Is there a guide on dealing with big models? How big can a model
    grow and what is the proportional server clockspeed/RAM necessary?


    Regards,
    Liling


    On Tue, Apr 21, 2015 at 6:39 PM, liling tan <[email protected]
    <mailto:[email protected]>> wrote:

        Dear Moses devs/users,

        *How should one work with big models?*

        Originally, I've 4.5 million parallel sentences and ~13
        million sentences monolingual data for source and target
        languages.

        After cleaning with
        
https://github.com/alvations/mosesdecoder/blob/master/scripts/other/gacha_filter.py
        and
        
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl,
        I got 2.6 million parallel sentences.


        And after training a phrase-based model with reordering, i get:

            9.9GB of phrase-table.gz
            3.2GB of reordering-table.gz
            ~45GB of language-model.arpa.gz


        With language model, I've binarized it and got to

            ~75GB of language-model.binary

        We ran moses-mert.pl <http://moses-mert.pl> and it completed
        the tuning in 3-4 days on both directions on the dev set (3000
        sentences), after filtering:


            364M phrase-table.gz
            1.8GB reordering-table.gz


        On the test set, we did the filtering too but when decoding it
        took 18 hours to load only 50% of the phrase table:

            1.5GB phrase-table.gz
            6.7GB reordering-table.gz


        So we decided to compactize the phrase table.

        With the phrase-table and reordering, we used the
        processPhraseTableMin and processLexicalTableMin and I'm still
        waiting to get the minimized phrasetable table. It has been
        running for 3 hours on 10 threads each on a 2.5GHz cores.

        *Anyone have any rough idea how small the phrase table and
        lexical table would get?*
        *
        *
        *With that kind of model, how much RAM would be necessary? And
        how long would it take to load the model onto the RAM?

        Any other tips/hints on working with big models efficiently? *

        *Is it even possible for us to use models at such a size on
        our small server (24 cores, 2.5GHz, 128RAM)? If not, how big
        should our sever get?*

        Regards,
        Liling





_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to