Hi Moonloki,

The token limit we use is 50 and we also restrict the token ratio to 9. This is 
only done in order to get standard GIZA++ to work. As a rule, we try to include 
as much of our data as we can.

The average segment length for our data is about 12 tokens on the EN side and 
about 13 tokens on the ZH_HANS side.


Ventzi

–––––––
Dr. Ventsislav Zhechev
Computational Linguist

Language Technologies
Localisation Services
Autodesk Development Sàrl
Neuchâtel, Switzerland

http://VentsislavZhechev.eu
tel: +41 32 723 9122
fax: +41 32 723 9399


13.04.2012, в 13:02, Loki Cheng написал(а):

> Hi, Ventzi
> May I ask you a question, do you restrict the sentence length in your 
> training corpus? let's said, for example:
> sentence length from 1 to 20
> 
> Thus, the translation quality would be better(?).
> 
> Best regards
> Moonloki
> 2012/4/13 "Венцислав Жечев (Ventsislav Zhechev)" 
> <[email protected]>
> Hi Moonloki,
> 
> You cannot in principle compare BLEU scores across different data samples. 
> The score may vary wildly based on your training set quality and size and on 
> how close the test set is related to the training data. Also—especially for 
> EN–ZH translation—your results will depend on which tokeniser and segmented 
> you used for ZH.
> 
> Still, here are some results from our experience.
> We train on about 5.2M segments of parallel EN–ZH_HANS in-house data—from 
> documentation TMs and software UI strings across our product range. We don’t 
> use tuning, as  MT may be used for data from different domains, that is from 
> different products. We use the KyTea segmenter with the lcmc-0.3.0-1.mod 
> segmentation model for ZH. We use an in-house tokeniser based on a cascade of 
> regular expressions.
> With this setup, for data similar to our main product range, we get BLEU 
> scores of about ,50 for EN–ZH_HANS translation. For data coming from niche 
> products, the BLEU score goes down to about ,40.
> 
> 
> Hope this gives you a perspective.
> 
> 
> Cheers,
> 
> Ventzi
> 
> –––––––
> Dr. Ventsislav Zhechev
> Computational Linguist
> 
> Language Technologies
> Localisation Services
> Autodesk Development Sàrl
> Neuchâtel, Switzerland
> 
> http://VentsislavZhechev.eu
> tel: +41 32 723 9122
> fax: +41 32 723 9399
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to