Hi all, I use a server which is 130GB RAM and 24 cores. I have a wonder about the training data which I could use.
In fact, I want to train an STM system from a very large bilingual corpus such as WMT 2010 (or NIST) to see what is the biggest BLEU score I could obtain (through I known that it also depends deeply from the test size). However, I usually obtain some unwanted errors in the MOSES's training. I have to truncate to obtain a smaller training corpus. If I do not truncate the size, I am usually stuck some errors such as: ERROR: Execution of: /home/cuongh/CODE/giza-pp/GIZA++ -CoocurrenceFile /home/cuongh/STATMT.BIG/giza.fr-en/fr-en.cooc -c /home/cuongh/STATMT.BIG/corpus/fr-en-int-train.snt -m1 5 -m2 3 -m3 3 -m4 0 -mh 0 -model1dumpfrequency 1 -model4smoothfactor 0.4 -nodumps 1 -nsmooth 4 -o /home/cuongh/STATMT.BIG/giza.fr-en/fr-en -onlyaldumps 1 -p0 0.999 -s /home/cuongh/STATMT.BIG/corpus/en.vcb -t /home/cuongh/STATMT.BIG/corpus/fr.vcb * died with signal 11, with coredump* I just wonder that for a server is used like mine, what is the largest training data I could train? In addition, for trainining MOSES on a very large bilingual data, what are the recommends from the experts here would advice to me? I really need it. I love working on SMT but frankly, I'm now just a Master student, not a PhD. However, I will graduate soon. Tks, Best regards, C. Hoang -- Hoàng Cường SMTNerd
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
