Re: [Moses-support] BLEU Score Variance: Which score to use?

Hokage Sama Mon, 22 Jun 2015 00:37:34 -0700

Wow that was a long read. Still reading though :) but I see that tuning is
essential. I am fairly new to Moses so could you please check if the
commands I ran were correct (minus the tuning part). I just modified the
commands on the Moses website for building a baseline system. Below are the
commands I ran. My training files are "compilation.en" and "compilation.sm".
My test files are "test.en" and "test.sm".


~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en <
~/corpus/training/compilation.en > ~/corpus/compilation.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < ~/corpus/training/
compilation.sm > ~/corpus/compilation.tok.sm
~/mosesdecoder/scripts/recaser/train-truecaser.perl --model
~/corpus/truecase-model.en --corpus ~/corpus/compilation.tok.en
~/mosesdecoder/scripts/recaser/train-truecaser.perl --model ~/corpus/
truecase-model.sm --corpus ~/corpus/compilation.tok.sm
~/mosesdecoder/scripts/recaser/truecase.perl --model
~/corpus/truecase-model.en < ~/corpus/compilation.tok.en >
~/corpus/compilation.true.en
~/mosesdecoder/scripts/recaser/truecase.perl --model ~/corpus/
truecase-model.sm < ~/corpus/compilation.tok.sm > ~/corpus/
compilation.true.sm
~/mosesdecoder/scripts/training/clean-corpus-n.perl
~/corpus/compilation.true sm en ~/corpus/compilation.clean 1 80

cd ~/working
nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train
-corpus ~/corpus/compilation.clean -f sm -e en -alignment
grow-diag-final-and -reordering msd-bidirectional-fe -lm
0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 -external-bin-dir
~/mosesdecoder/tools >& training.out &

cd ~/corpus
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < test.en >
test.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < test.sm >
test.tok.sm
~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en <
test.tok.en > test.true.en
~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.sm <
test.tok.sm > test.true.sm

cd ~/working
~/mosesdecoder/scripts/training/filter-model-given-input.pl filtered-test
train/model/moses.ini ~/corpus/test.true.sm -Binarizer
~/mosesdecoder/bin/processPhraseTableMin
nohup nice ~/mosesdecoder/bin/moses -f ~/working/filtered-test/moses.ini <
~/corpus/test.true.sm > ~/working/test.translated.en 2> ~/working/test.out
~/mosesdecoder/scripts/generic/multi-bleu.perl -lc ~/corpus/test.true.en <
~/working/test.translated.en

On 22 June 2015 at 01:20, Marcin Junczys-Dowmunt <[email protected]> wrote:

> Hm. That's interesting. The language should not matter.
>
> 1) Do not report results without tuning. They are meaningless. There is a
> whole thread on that, look for "Major bug found in Moses". If you ignore
> the trollish aspects it contains may good descriptions why this is a
> mistake.
>
> 2) Assuming it was the same data every time (was it?), without tuning
> however I do not quite see where the variance is coming from. This rather
> suggests you have something weird in your pipeline. Mgiza is the only
> stochastic element there, but usually its results are quite consistent. For
> the same weights in your ini-file you should have very similar results.
> Tuning would be the part that introduces instability, but even then these
> differences would be a little on the extreme end, though possible.
>
> On 22.06.2015 08:12, Hokage Sama wrote:
>
>> Thanks Marcin. Its for a new resource-poor language so I only trained it
>> with what I could collect so far (i.e. only 190,630 words of parallel
>> data). I retrained the entire system each time without any tuning.
>>
>> On 22 June 2015 at 01:00, Marcin Junczys-Dowmunt <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Hi,
>>     I think the average is OK, your variance is however quite high.
>>     Did you
>>     retrain the entire system or just optimize parameters a couple of
>>     times?
>>
>>     Two useful papers on the topic:
>>
>>     https://www.cs.cmu.edu/~jhclark/pubs/significance.pdf
>>     <https://www.cs.cmu.edu/%7Ejhclark/pubs/significance.pdf>
>>     http://www.mt-archive.info/MTS-2011-Cettolo.pdf
>>
>>
>>     On 22.06.2015 02 <tel:22.06.2015%2002>:37, Hokage Sama wrote:
>>     > Hi,
>>     >
>>     > Since MT training is non-convex and thus the BLEU score varies,
>>     which
>>     > score should I use for my system? I trained my system three times
>>     > using the same data and obtained the three different scores below.
>>     > Should I take the average or the best score?
>>     >
>>     > BLEU = 17.84, 49.1/22.0/12.5/7.5 (BP=1.000, ratio=1.095,
>>     hyp_len=3952,
>>     > ref_len=3609)
>>     > BLEU = 16.51, 48.4/20.7/11.4/6.5 (BP=1.000, ratio=1.093,
>>     hyp_len=3945,
>>     > ref_len=3609)
>>     > BLEU = 15.33, 48.2/20.1/10.3/5.5 (BP=1.000, ratio=1.087,
>>     hyp_len=3924,
>>     > ref_len=3609)
>>     >
>>     > Thanks,
>>     > Hilton
>>     >
>>     >
>>     > _______________________________________________
>>     > Moses-support mailing list
>>     > [email protected] <mailto:[email protected]>
>>     > http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>     _______________________________________________
>>     Moses-support mailing list
>>     [email protected] <mailto:[email protected]>
>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU Score Variance: Which score to use?

Reply via email to