Hi Moonloki, You cannot in principle compare BLEU scores across different data samples. The score may vary wildly based on your training set quality and size and on how close the test set is related to the training data. Also—especially for EN–ZH translation—your results will depend on which tokeniser and segmented you used for ZH.
Still, here are some results from our experience. We train on about 5.2M segments of parallel EN–ZH_HANS in-house data—from documentation TMs and software UI strings across our product range. We don’t use tuning, as MT may be used for data from different domains, that is from different products. We use the KyTea segmenter with the lcmc-0.3.0-1.mod segmentation model for ZH. We use an in-house tokeniser based on a cascade of regular expressions. With this setup, for data similar to our main product range, we get BLEU scores of about ,50 for EN–ZH_HANS translation. For data coming from niche products, the BLEU score goes down to about ,40. Hope this gives you a perspective. Cheers, Ventzi ––––––– Dr. Ventsislav Zhechev Computational Linguist Language Technologies Localisation Services Autodesk Development Sàrl Neuchâtel, Switzerland http://VentsislavZhechev.eu tel: +41 32 723 9122 fax: +41 32 723 9399 _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
