Thanks Michael for the paper and thanks Tom. Based on the paper, one solution is replication of MERT and testing at least three times. My ideas have subtle effects on BLUE. Do you recommend me run MERT and testing three times or more? should i increase the number of sentences for tuning? my dataset for Persian to English includes: Training: about 240000 sentences Tune: 1000 sentences Test: 1000 sentences From: [email protected] Date: Sun, 11 Oct 2015 12:53:37 +0700 To: [email protected] Subject: Re: [Moses-support] BLEU score difference about 0.13 for one dataset is normal? Yes. Each tuning with the same test set will give you small variations in the final BLEU. Yours looks like they're in a normal range. Date: Sun, 11 Oct 2015 04:23:56 +0000 From: Davood Mohammadifar <[email protected]> Subject: [Moses-support] BLEU score difference about 0.13 for one dataset is normal? To: Moses Support <[email protected]> Hello every one I noticed different BLEU scores for same dataset. Also the difference is not so much and is about 0.13. I trained my dataset and tuned development set for Persian-English translation. after testing, the score was 21.95. For second time i did the same process and obtained 21.82. (my tools were mgiza, mert, ...) is this difference normal? My system: CPU: Core i7-4790K RAM: 16GB OS: ubuntu 12.04 Thanks _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
