Hi Moonloki, The token limit we use is 50 and we also restrict the token ratio to 9. This is only done in order to get standard GIZA++ to work. As a rule, we try to include as much of our data as we can.
The average segment length for our data is about 12 tokens on the EN side and about 13 tokens on the ZH_HANS side. Ventzi ––––––– Dr. Ventsislav Zhechev Computational Linguist Language Technologies Localisation Services Autodesk Development Sàrl Neuchâtel, Switzerland http://VentsislavZhechev.eu tel: +41 32 723 9122 fax: +41 32 723 9399 13.04.2012, в 13:02, Loki Cheng написал(а): > Hi, Ventzi > May I ask you a question, do you restrict the sentence length in your > training corpus? let's said, for example: > sentence length from 1 to 20 > > Thus, the translation quality would be better(?). > > Best regards > Moonloki > 2012/4/13 "Венцислав Жечев (Ventsislav Zhechev)" > <[email protected]> > Hi Moonloki, > > You cannot in principle compare BLEU scores across different data samples. > The score may vary wildly based on your training set quality and size and on > how close the test set is related to the training data. Also—especially for > EN–ZH translation—your results will depend on which tokeniser and segmented > you used for ZH. > > Still, here are some results from our experience. > We train on about 5.2M segments of parallel EN–ZH_HANS in-house data—from > documentation TMs and software UI strings across our product range. We don’t > use tuning, as MT may be used for data from different domains, that is from > different products. We use the KyTea segmenter with the lcmc-0.3.0-1.mod > segmentation model for ZH. We use an in-house tokeniser based on a cascade of > regular expressions. > With this setup, for data similar to our main product range, we get BLEU > scores of about ,50 for EN–ZH_HANS translation. For data coming from niche > products, the BLEU score goes down to about ,40. > > > Hope this gives you a perspective. > > > Cheers, > > Ventzi > > ––––––– > Dr. Ventsislav Zhechev > Computational Linguist > > Language Technologies > Localisation Services > Autodesk Development Sàrl > Neuchâtel, Switzerland > > http://VentsislavZhechev.eu > tel: +41 32 723 9122 > fax: +41 32 723 9399 > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
