Hi, Ventzi May I ask you a question, do you restrict the sentence length in your training corpus? let's said, for example: sentence length from 1 to 20
Thus, the translation quality would be better(?). Best regards Moonloki 2012/4/13 "Венцислав Жечев (Ventsislav Zhechev)" < [email protected]> > Hi Moonloki, > > You cannot in principle compare BLEU scores across different data samples. > The score may vary wildly based on your training set quality and size and > on how close the test set is related to the training data. Also—especially > for EN–ZH translation—your results will depend on which tokeniser and > segmented you used for ZH. > > Still, here are some results from our experience. > We train on about 5.2M segments of parallel EN–ZH_HANS in-house data—from > documentation TMs and software UI strings across our product range. We > don’t use tuning, as MT may be used for data from different domains, that > is from different products. We use the KyTea segmenter with the > lcmc-0.3.0-1.mod segmentation model for ZH. We use an in-house tokeniser > based on a cascade of regular expressions. > With this setup, for data similar to our main product range, we get BLEU > scores of about ,50 for EN–ZH_HANS translation. For data coming from niche > products, the BLEU score goes down to about ,40. > > > Hope this gives you a perspective. > > > Cheers, > > Ventzi > > ––––––– > Dr. Ventsislav Zhechev > Computational Linguist > > Language Technologies > Localisation Services > Autodesk Development Sàrl > Neuchâtel, Switzerland > > http://VentsislavZhechev.eu > tel: +41 32 723 9122 > fax: +41 32 723 9399 > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
