Re: [Moses-support] Large parallel corpora

Hieu Hoang Mon, 23 Jun 2014 07:24:39 -0700

i've never done it. There's a limit on the maximum number of shards when
parallelizing extract and scoring during training.


I've upped the limit (from  99,999) to 9,999,999

https://github.com/moses-smt/mosesdecoder/commit/f95a1bb75b2add5b7dcd1e3e5c76777f2f141e21

Other than that, I can't think of any other issues.

To minimize disk space usage (and probably increase speed too), compress
the intermediate training files. Also, optimize the sorting. This is my
arguments to train-model.perl to do these things
   ..../train-model.perl -sort-buffer-size 1G -sort-batch-size 253
-sort-compress gzip -cores 8



On 20 June 2014 09:50, Tom Hoar <[email protected]>
wrote:

> Does anyone have experience (words-of-wisdom) training the translation
> model from a parallel corpus with 2.25 trillion phrase pairs and over 45
> trillion tokens?
>
> Thanks,
> Tom
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>



-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Large parallel corpora

Reply via email to