Dear Colleagues, We are using Moses to revitalize Lemko, an endangered low-resource language. We have 70,000 Lemko words in 3,387 segments perfectly translated into native English and perfectly aligned. Current BLEU score is about 0.10. As far as hardware goes, we're using the cloud: Amazon EC2 p2.xlarge (1 GPU, 4 vCPus, 61 GiB RAM). Questions: - How divide our precious 3,387 bilingual segments into training, tuning, and testing data? What ratio is ideal? - Considering that at this point, bilingual content is much dearer to us than processing power (Amazon AWS costs us USD 0.90 per hour, while translation costs us USD 0.15 per word), how do we make the most of what we've got? - Is there anything we could do other than the default settings that might lead to a large improvement in the BLEU score?
Current training model: ~/workspace/mosesdecoder/scripts/training/train-model.perl \ --parallel --mgiza-cpus 4 \ -root-dir train \ --corpus ~/corpus/train.ru-en.clean \ --f ru --e en \ --alignment grow-diag-final-and \ --reordering msd-bidirectional-fe \ --lm 0:3:/home/ubuntu/lm/train.ru-en.blm.en:8 \ -external-bin-dir ~/workspace/bin/training-tools/mgizapp Current tuning model: ~/workspace/mosesdecoder/scripts/training/mert-moses.pl \ ~/corpus/tune.ru-en.true.ru ~/corpus/tune.ru-en.true.en \ ~/workspace/mosesdecoder/bin/moses ~/working/train/model/moses.ini --mertdir ~/workspace/mosesdecoder/bin/ \ --decoder-flags="-threads 4" Thanks for your help! _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
