[Moses-support] Training on larger sample yields terrible result

Per Tunedal Tue, 02 Apr 2013 23:44:47 -0700

Hi,
Inspired by the paper "Does more data always yield better translations?"
@ aclweb.org/anthology-new/E/E12/E12-1016.pdf, that Ken Fasano kindly
linked to, I've experimented a great deal.


I've tested several ways to pick a good sample of sentences from  the
Europarl corpus, picking 10 % of the sentences. I just thought I've
found a promising method and decided to pick a larger sample, 35 %, and
expected a very much improved translation. On the contrary, the
translation of my test-text was terrible. It was turned into garbage.
Completely useless.

I trained the phrase model with:
nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir
train -corpus ~/corpora/Total1.sv-fr.clean_urval -f sv -e fr -alignment
grow-diag-final-and -reordering msd-bidirectional-fe -lm
0:3:$HOME/lm/Total1.blm.fr:8 -external-bin-dir ~/mosesdecoder/tools
-parallel -cores 4 -score-options --GoodTuring >& training.out &

The training was incredibly fast, in spite of the larger training
corpus.
After the line stating that moses.ini was created I found lots of
warnings of the type:
"has alignment point (15, 19) out of bounds (15, WARNING: sentence
2448049)"

Further the model (= the model folder) is very small: 277 MB, 
phrase-table.gz: 83 MB. 
The previous training with the same sample method (only 10% of the
Eurorparl) yielded: 495 MB phrase-table.gz: 173 MB

Why this strange result? I suppose it has something to do with how the
phrases actually are extracted by Moses. The simple explanation "phrases
that are consistent with the word alignment" doesn't tell me enough.
Besides, I don't fully understand what it means. Maybe a very simple
example would make me understand the process.

Yours,
Per Tunedal
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] Training on larger sample yields terrible result

Reply via email to