Dear Colleagues,
We are using Moses to revitalize Lemko, an endangered low-resource
language. We have 70,000 Lemko words in 3,387 segments perfectly
translated into native English and perfectly aligned.
Current BLEU score is about 0.10.
As far as hardware goes, we're using the cloud: Amazon EC2 p2.xlarge
(1 GPU, 4 vCPus, 61 GiB RAM).
Questions:
- How divide our precious 3,387 bilingual segments into training,
tuning, and testing data? What ratio is ideal?
- Considering that at this point, bilingual content is much dearer to
us than processing power (Amazon AWS costs us USD 0.90 per hour, while
translation costs us USD 0.15 per word), how do we make the most of
what we've got?
- Is there anything we could do other than the default settings that
might lead to a large improvement in the BLEU score?

Current training model:
~/workspace/mosesdecoder/scripts/training/train-model.perl \
--parallel --mgiza-cpus 4 \
-root-dir train \
--corpus ~/corpus/train.ru-en.clean \
--f ru --e en \
--alignment grow-diag-final-and \
--reordering msd-bidirectional-fe \
--lm 0:3:/home/ubuntu/lm/train.ru-en.blm.en:8 \
-external-bin-dir ~/workspace/bin/training-tools/mgizapp

Current tuning model:
~/workspace/mosesdecoder/scripts/training/mert-moses.pl \
~/corpus/tune.ru-en.true.ru ~/corpus/tune.ru-en.true.en \
~/workspace/mosesdecoder/bin/moses ~/working/train/model/moses.ini
--mertdir ~/workspace/mosesdecoder/bin/ \
--decoder-flags="-threads 4"

Thanks for your help!
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to