Re: [Moses-support] unknown words caused by Chinese segmentation differences between training and test data

Yaqin Sat, 19 Jan 2013 17:47:18 -0800

Hi Hieu, for some reason, I don't have the model used to segment
training set. But I re-segment the test data using the training set.
The segmentation is now more consistent and reduces the unknown words.


Thanks Jie. I'll take a look.

Yaqin

On Sat, Jan 19, 2013 at 8:08 PM, Jie Jiang <[email protected]> wrote:
> HI Yaqin:
>
> Source side word lattice might help in this case, please refer to the
> related section in the following paper:
>
> Christopher Dyer, Smaranda Muresan, Philip Resnik, Generalizing Word Lattice
> Translation. In Proceedings of ACL-08: HLT (June 2008), pp. 1012-1020
>
> Best regards,
>
> Jie Jiang
> Senior Language Technology Specialist
>
> Capita Translation and Interpreting
>
> Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44
> 845 367 7000 | Tel (US): +1 (800) 579-5010
> Tel Direct: +44 (0)844 854 8984 | [email protected] | Skype ID:
> jie.jiang-capita-ti
> www.capitatranslationinterpreting.com
>
>
> 2013/1/18 Yaqin <[email protected]>
>>
>> Dear all,
>>
>> I'm using moses phrase-bases system to translate from Chinese to English.
>>
>> I found a lot unknown words in the translation results of test data
>> are caused by the segmentation differences between the training data
>> and test data on the Chinese side.
>>
>> For example "全球化" (globalization) is segmented as one word in the test
>> data, while it's segmented into two words "全球" and "化" in the training
>> data. Thus, "全球化" is not recognized and failed to be translated.
>>
>> Does anyone have any suggestion on this problem?
>>
>> Thanks,
>> Yaqin
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] unknown words caused by Chinese segmentation differences between training and test data

Reply via email to