Hi, Inspired by the paper "Does more data always yield better translations?" @ aclweb.org/anthology-new/E/E12/E12-1016.pdf, that Ken Fasano kindly linked to, I've experimented a great deal.
I've tested several ways to pick a good sample of sentences from the Europarl corpus, picking 10 % of the sentences. I just thought I've found a promising method and decided to pick a larger sample, 35 %, and expected a very much improved translation. On the contrary, the translation of my test-text was terrible. It was turned into garbage. Completely useless. I trained the phrase model with: nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train -corpus ~/corpora/Total1.sv-fr.clean_urval -f sv -e fr -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:$HOME/lm/Total1.blm.fr:8 -external-bin-dir ~/mosesdecoder/tools -parallel -cores 4 -score-options --GoodTuring >& training.out & The training was incredibly fast, in spite of the larger training corpus. After the line stating that moses.ini was created I found lots of warnings of the type: "has alignment point (15, 19) out of bounds (15, WARNING: sentence 2448049)" Further the model (= the model folder) is very small: 277 MB, phrase-table.gz: 83 MB. The previous training with the same sample method (only 10% of the Eurorparl) yielded: 495 MB phrase-table.gz: 173 MB Why this strange result? I suppose it has something to do with how the phrases actually are extracted by Moses. The simple explanation "phrases that are consistent with the word alignment" doesn't tell me enough. Besides, I don't fully understand what it means. Maybe a very simple example would make me understand the process. Yours, Per Tunedal _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
