Re: [Moses-support] BLEU Score Variance: Which score to use?

Marcin Junczys-Dowmunt Mon, 22 Jun 2015 00:54:38 -0700

Don't see any reason for indeterminism here. Unless mgiza is less stable 
for small data than I thought. The lm lm/news-commentary-v8.fr-en.blm.en 
has been built earlier somewhere?


And to be sure: for all three runs you used exactly the same data, 
training and test set?

On 22.06.2015 09:34, Hokage Sama wrote:
> Wow that was a long read. Still reading though :) but I see that 
> tuning is essential. I am fairly new to Moses so could you please 
> check if the commands I ran were correct (minus the tuning part). I 
> just modified the commands on the Moses website for building a 
> baseline system. Below are the commands I ran. My training files are 
> "compilation.en" and "compilation.sm <http://compilation.sm>". My test 
> files are "test.en" and "test.sm <http://test.sm>".
>
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < 
> ~/corpus/training/compilation.en > ~/corpus/compilation.tok.en
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < 
> ~/corpus/training/compilation.sm <http://compilation.sm> > 
> ~/corpus/compilation.tok.sm <http://compilation.tok.sm>
> ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model 
> ~/corpus/truecase-model.en --corpus ~/corpus/compilation.tok.en
> ~/mosesdecoder/scripts/recaser/train-truecaser.perl --model 
> ~/corpus/truecase-model.sm <http://truecase-model.sm> --corpus 
> ~/corpus/compilation.tok.sm <http://compilation.tok.sm>
> ~/mosesdecoder/scripts/recaser/truecase.perl --model 
> ~/corpus/truecase-model.en < ~/corpus/compilation.tok.en > 
> ~/corpus/compilation.true.en
> ~/mosesdecoder/scripts/recaser/truecase.perl --model 
> ~/corpus/truecase-model.sm <http://truecase-model.sm> < 
> ~/corpus/compilation.tok.sm <http://compilation.tok.sm> > 
> ~/corpus/compilation.true.sm <http://compilation.true.sm>
> ~/mosesdecoder/scripts/training/clean-corpus-n.perl 
> ~/corpus/compilation.true sm en ~/corpus/compilation.clean 1 80
>
> cd ~/working
> nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir 
> train -corpus ~/corpus/compilation.clean -f sm -e en -alignment 
> grow-diag-final-and -reordering msd-bidirectional-fe -lm 
> 0:3:$HOME/lm/news-commentary-v8.fr-en.blm.en:8 -external-bin-dir 
> ~/mosesdecoder/tools >& training.out &
>
> cd ~/corpus
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en < test.en > 
> test.tok.en
> ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l sm < test.sm 
> <http://test.sm> > test.tok.sm <http://test.tok.sm>
> ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.en 
> < test.tok.en > test.true.en
> ~/mosesdecoder/scripts/recaser/truecase.perl --model truecase-model.sm 
> <http://truecase-model.sm> < test.tok.sm <http://test.tok.sm> > 
> test.true.sm <http://test.true.sm>
>
> cd ~/working
> ~/mosesdecoder/scripts/training/filter-model-given-input.pl 
> <http://filter-model-given-input.pl> filtered-test 
> train/model/moses.ini ~/corpus/test.true.sm <http://test.true.sm> 
> -Binarizer ~/mosesdecoder/bin/processPhraseTableMin
> nohup nice ~/mosesdecoder/bin/moses -f 
> ~/working/filtered-test/moses.ini < ~/corpus/test.true.sm 
> <http://test.true.sm> > ~/working/test.translated.en 2> ~/working/test.out
> ~/mosesdecoder/scripts/generic/multi-bleu.perl -lc 
> ~/corpus/test.true.en < ~/working/test.translated.en
>
> On 22 June 2015 at 01:20, Marcin Junczys-Dowmunt <[email protected] 
> <mailto:[email protected]>> wrote:
>
>     Hm. That's interesting. The language should not matter.
>
>     1) Do not report results without tuning. They are meaningless.
>     There is a whole thread on that, look for "Major bug found in
>     Moses". If you ignore the trollish aspects it contains may good
>     descriptions why this is a mistake.
>
>     2) Assuming it was the same data every time (was it?), without
>     tuning however I do not quite see where the variance is coming
>     from. This rather suggests you have something weird in your
>     pipeline. Mgiza is the only stochastic element there, but usually
>     its results are quite consistent. For the same weights in your
>     ini-file you should have very similar results. Tuning would be the
>     part that introduces instability, but even then these differences
>     would be a little on the extreme end, though possible.
>
>     On 22.06.2015 08 <tel:22.06.2015%2008>:12, Hokage Sama wrote:
>
>         Thanks Marcin. Its for a new resource-poor language so I only
>         trained it with what I could collect so far (i.e. only 190,630
>         words of parallel data). I retrained the entire system each
>         time without any tuning.
>
>         On 22 June 2015 at 01:00, Marcin Junczys-Dowmunt
>         <[email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>> wrote:
>
>             Hi,
>             I think the average is OK, your variance is however quite
>         high.
>             Did you
>             retrain the entire system or just optimize parameters a
>         couple of
>             times?
>
>             Two useful papers on the topic:
>
>         https://www.cs.cmu.edu/~jhclark/pubs/significance.pdf
>         <https://www.cs.cmu.edu/%7Ejhclark/pubs/significance.pdf>
>             <https://www.cs.cmu.edu/%7Ejhclark/pubs/significance.pdf>
>         http://www.mt-archive.info/MTS-2011-Cettolo.pdf
>
>
>             On 22.06.2015 02 <tel:22.06.2015%2002>
>         <tel:22.06.2015%2002>:37, Hokage Sama wrote:
>             > Hi,
>             >
>             > Since MT training is non-convex and thus the BLEU score
>         varies,
>             which
>             > score should I use for my system? I trained my system
>         three times
>             > using the same data and obtained the three different
>         scores below.
>             > Should I take the average or the best score?
>             >
>             > BLEU = 17.84, 49.1/22.0/12.5/7.5 (BP=1.000, ratio=1.095,
>             hyp_len=3952,
>             > ref_len=3609)
>             > BLEU = 16.51, 48.4/20.7/11.4/6.5 (BP=1.000, ratio=1.093,
>             hyp_len=3945,
>             > ref_len=3609)
>             > BLEU = 15.33, 48.2/20.1/10.3/5.5 (BP=1.000, ratio=1.087,
>             hyp_len=3924,
>             > ref_len=3609)
>             >
>             > Thanks,
>             > Hilton
>             >
>             >
>             > _______________________________________________
>             > Moses-support mailing list
>             > [email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>
>             > http://mailman.mit.edu/mailman/listinfo/moses-support
>
>             _______________________________________________
>             Moses-support mailing list
>         [email protected] <mailto:[email protected]>
>         <mailto:[email protected] <mailto:[email protected]>>
>         http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU Score Variance: Which score to use?

Reply via email to