Hi all,
I use a server which is 130GB RAM and 24 cores.
I have a wonder about the training data which I could use.

In fact, I want to train an STM system from a very large bilingual corpus
such as WMT 2010 (or NIST) to see what is the biggest BLEU score I could
obtain (through I known that it also depends deeply from the test size).

However, I usually obtain some unwanted errors in the MOSES's training. I
have to truncate to obtain a smaller training corpus. If I do not truncate
the size, I am usually stuck some errors such as:

ERROR: Execution of: /home/cuongh/CODE/giza-pp/GIZA++  -CoocurrenceFile
/home/cuongh/STATMT.BIG/giza.fr-en/fr-en.cooc -c
/home/cuongh/STATMT.BIG/corpus/fr-en-int-train.snt -m1 5 -m2 3 -m3 3 -m4 0
-mh 0 -model1dumpfrequency 1 -model4smoothfactor 0.4 -nodumps 1 -nsmooth 4
-o /home/cuongh/STATMT.BIG/giza.fr-en/fr-en -onlyaldumps 1 -p0 0.999 -s
/home/cuongh/STATMT.BIG/corpus/en.vcb -t
/home/cuongh/STATMT.BIG/corpus/fr.vcb
*  died with signal 11, with coredump*

I just wonder that for a server is used like mine, what is the largest
training data I could train?
In addition, for trainining MOSES on a very large bilingual data, what are
the recommends from the experts here would advice to me?

I really need it.
I love working on SMT but frankly, I'm now just a Master student, not a
PhD. However, I will graduate soon.
Tks,
Best regards,
C. Hoang
-- 
Hoàng Cường
SMTNerd
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to