Hi,
@Marcin, the bigger than usual reordering-table is due to our
allowance for high distortion. 2.4 is after cleaning it up, the
original size contains loads of rubbish sentence pairs.
Where do have that distortion?
BTW, the compactization finished at <4hrs. I guess at the 3rd hour i
was starting to doubt whether the server can handle that amount.
The binarization is not that heavy on the server. It just takes a while.
As long as there is progress you are fine.
But the phrase size didn't go down as much as i expect, it's
still 1.1G which might take forever to load when decoding. Will
.minphr file be faster to load (it looks binarized, i think) than the
normal .gz phrase table? If not, we're still looking at >18hrs of
loading time on the server.
Try it :) Should not take more than a couple of seconds.
But the reordering went down to from 6.7GB -> 420M.
Weird. I a little bit suspicious of your text tables, as the size
distributions seem so unusual. But if it works for you, then alright.
What exactly is the process of dealing with models >4GB? The standard
moses tutorial on the "moses rights of passage" and processes would be
failing at every instances when considering non-binarized LM,
non-compactize phrase-table/lexical-table, non-threaded
processing/training/decoding.
Is there a guide on dealing with big models? How big can a model grow
and what is the proportional server clockspeed/RAM necessary?
I have a 128 GB server and I am building and using models from 150 M
parallel sentences, and LMs from hundreds of GB of monolingual text, I
am doing just fine. Unbinarized models are not meant for deployment on
any machine whatever size. Treat the text models as intermediate
representations, binarized models as final deployment models. You are
fine in terms of RAM if your binarized models fit into RAM + a couple of
GB for computations.
Regards,
Liling
On Tue, Apr 21, 2015 at 6:39 PM, liling tan <alvati...@gmail.com
<mailto:alvati...@gmail.com>> wrote:
Dear Moses devs/users,
*How should one work with big models?*
Originally, I've 4.5 million parallel sentences and ~13 million
sentences monolingual data for source and target languages.
After cleaning with
https://github.com/alvations/mosesdecoder/blob/master/scripts/other/gacha_filter.py
and
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl,
I got 2.6 million parallel sentences.
And after training a phrase-based model with reordering, i get:
9.9GB of phrase-table.gz
3.2GB of reordering-table.gz
~45GB of language-model.arpa.gz
With language model, I've binarized it and got to
~75GB of language-model.binary
We ran moses-mert.pl <http://moses-mert.pl> and it completed the
tuning in 3-4 days on both directions on the dev set (3000
sentences), after filtering:
364M phrase-table.gz
1.8GB reordering-table.gz
On the test set, we did the filtering too but when decoding it
took 18 hours to load only 50% of the phrase table:
1.5GB phrase-table.gz
6.7GB reordering-table.gz
So we decided to compactize the phrase table.
With the phrase-table and reordering, we used the
processPhraseTableMin and processLexicalTableMin and I'm still
waiting to get the minimized phrasetable table. It has been
running for 3 hours on 10 threads each on a 2.5GHz cores.
*Anyone have any rough idea how small the phrase table and lexical
table would get?*
*
*
*With that kind of model, how much RAM would be necessary? And how
long would it take to load the model onto the RAM?
Any other tips/hints on working with big models efficiently? *
*Is it even possible for us to use models at such a size on our
small server (24 cores, 2.5GHz, 128RAM)? If not, how big should
our sever get?*
Regards,
Liling
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support