Hi, Ventzi
May I ask you a question, do you restrict the sentence length in your
training corpus? let's said, for example:
sentence length from 1 to 20

Thus, the translation quality would be better(?).

Best regards
Moonloki
2012/4/13 "Венцислав Жечев (Ventsislav Zhechev)" <
[email protected]>

> Hi Moonloki,
>
> You cannot in principle compare BLEU scores across different data samples.
> The score may vary wildly based on your training set quality and size and
> on how close the test set is related to the training data. Also—especially
> for EN–ZH translation—your results will depend on which tokeniser and
> segmented you used for ZH.
>
> Still, here are some results from our experience.
> We train on about 5.2M segments of parallel EN–ZH_HANS in-house data—from
> documentation TMs and software UI strings across our product range. We
> don’t use tuning, as  MT may be used for data from different domains, that
> is from different products. We use the KyTea segmenter with the
> lcmc-0.3.0-1.mod segmentation model for ZH. We use an in-house tokeniser
> based on a cascade of regular expressions.
> With this setup, for data similar to our main product range, we get BLEU
> scores of about ,50 for EN–ZH_HANS translation. For data coming from niche
> products, the BLEU score goes down to about ,40.
>
>
> Hope this gives you a perspective.
>
>
> Cheers,
>
> Ventzi
>
> –––––––
> Dr. Ventsislav Zhechev
> Computational Linguist
>
> Language Technologies
> Localisation Services
> Autodesk Development Sàrl
> Neuchâtel, Switzerland
>
> http://VentsislavZhechev.eu
> tel: +41 32 723 9122
> fax: +41 32 723 9399
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to