Hi,
binarizing like this gives you a lot smaller file:
build_binary trie -a 22 -b 8 -q 8 lm.arpa.gz lm.kenlm
This uses quantization, in theory that could cause quality loss, but I
never saw that happen. Remove "-b 8 -q 8" if you are afraid of that, the
file will be larger, but still a lot smaller than what you have. That's
about all I do. You said "100 MERT iterations" ... what do you mean by
that? Also the LM uses memory mapping in shared memory, so running
several moses instances in parallel does not use additional memory due
to the LM, similar for the phrase table.
W dniu 25.04.2015 o 21:05, liling tan pisze:
Dear Moses devs/users,
I've automated the binarization for phrase-model and reordering-model
and added the multi-threadness to the filter script in moses
(https://github.com/moses-smt/mosesdecoder/pull/109). Loading the
binarized and filterd translation models work fine.
The issue now comes to huge language models.
I've a 38GB compressed arpa language model from a 16GB of raw text.
Then I binarized it with "moses/bin/build_binary" and it grows to
71GB. It works pretty fine if i don't tune my system but when MERT
tuning on 100 iterations on 71GB, it's taking almost forever to tune.
I did a google search and found KenLM's filter:
https://kheafield.com/code/kenlm/filter/
But i'm clueless as to how to make it work.
*What should I do to the LM after binarization? *
*
*
*Is there any other steps to manipulate large language models to
reduce the computing load when tuning?*
*
*
*What is the usual way to tune on a large LM file?*
@Marcin, how did you deal with the large LM file when tuning?
Regards,
Liling
On Tue, Apr 21, 2015 at 7:48 PM, liling tan <[email protected]
<mailto:[email protected]>> wrote:
Dear Moses dev/users,
@Marcin, the bigger than usual reordering-table is due to our
allowance for high distortion. 2.4 is after cleaning it up, the
original size contains loads of rubbish sentence pairs.
BTW, the compactization finished at <4hrs. I guess at the 3rd hour
i was starting to doubt whether the server can handle that amount.
But the phrase size didn't go down as much as i expect, it's
still 1.1G which might take forever to load when decoding. Will
.minphr file be faster to load (it looks binarized, i think) than
the normal .gz phrase table? If not, we're still looking at >18hrs
of loading time on the server.
But the reordering went down to from 6.7GB -> 420M.
What exactly is the process of dealing with models >4GB? The
standard moses tutorial on the "moses rights of passage" and
processes would be failing at every instances when considering
non-binarized LM, non-compactize phrase-table/lexical-table,
non-threaded processing/training/decoding.
Is there a guide on dealing with big models? How big can a model
grow and what is the proportional server clockspeed/RAM necessary?
Regards,
Liling
On Tue, Apr 21, 2015 at 6:39 PM, liling tan <[email protected]
<mailto:[email protected]>> wrote:
Dear Moses devs/users,
*How should one work with big models?*
Originally, I've 4.5 million parallel sentences and ~13
million sentences monolingual data for source and target
languages.
After cleaning with
https://github.com/alvations/mosesdecoder/blob/master/scripts/other/gacha_filter.py
and
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl,
I got 2.6 million parallel sentences.
And after training a phrase-based model with reordering, i get:
9.9GB of phrase-table.gz
3.2GB of reordering-table.gz
~45GB of language-model.arpa.gz
With language model, I've binarized it and got to
~75GB of language-model.binary
We ran moses-mert.pl <http://moses-mert.pl> and it completed
the tuning in 3-4 days on both directions on the dev set (3000
sentences), after filtering:
364M phrase-table.gz
1.8GB reordering-table.gz
On the test set, we did the filtering too but when decoding it
took 18 hours to load only 50% of the phrase table:
1.5GB phrase-table.gz
6.7GB reordering-table.gz
So we decided to compactize the phrase table.
With the phrase-table and reordering, we used the
processPhraseTableMin and processLexicalTableMin and I'm still
waiting to get the minimized phrasetable table. It has been
running for 3 hours on 10 threads each on a 2.5GHz cores.
*Anyone have any rough idea how small the phrase table and
lexical table would get?*
*
*
*With that kind of model, how much RAM would be necessary? And
how long would it take to load the model onto the RAM?
Any other tips/hints on working with big models efficiently? *
*Is it even possible for us to use models at such a size on
our small server (24 cores, 2.5GHz, 128RAM)? If not, how big
should our sever get?*
Regards,
Liling
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support