Re: [Moses-support] Problems with segmentation mismatch and many unknown words for Chinese translation

Gideon Wenniger Tue, 03 Jun 2014 05:34:31 -0700

Hey Vincent and Hieu,
Thanks a lot for your replies.

This falls in line with my own expectation that something in the preprocessing 
of the data could be not going 
right. 
But in my case I sort of wrote my own pipeline for the preprocessing of the LDC 
Honkong Hansards,Laws and News 
data, so this is not a problem of Moses itself.
I am actually still working to run some new experiments using the MultiUN data 
from http://opus.lingfil.uu.se/. 
Looking up some of the unknown words from my old test output, it turns out that 
many of them can be found in the 
new segmented MultiUN data. So while I have no results yet, I am hopeful that 
the segmentation mismatch problem 
will be resoleved using the MultiUN data (which is already in simplified 
Chinese and further preprocessed than the LDC 
Honkong data, so less chance of errors in the segmentation/preprocessing to 
slip in). 
If experiments with this data indeed will give "normal" results, then I'll be 
sure that it is the preparation of the 
Honkong Hansards data which goes wrong.


This still does not completely answer the question what step(s) in my 
(pre)processing of the Chinese data are 
insufficient, but I thinks that Vincent's suggestion that it could be these 
special characters and using the script might 
resolve it is a very promising one. I will try this out as soon as I finished 
my pending experiments with the MultiUn data.
I will update you as soon as I know more.

Cheers.

Gideon



CC: [email protected]; [email protected]
From: [email protected]
Subject: Re: [Moses-support] Problems with segmentation mismatch and many 
unknown words for Chinese translation
Date: Tue, 3 Jun 2014 01:35:44 +0100
To: [email protected]

Hey Vincent and Gideon
Did you have any details of how it fails on new Moses but runs on on the old 
Moses? Or is it speculation? It's really important that I know so I can try and 
fix it

Sent from my flying horse
On 30 May 2014, at 05:11, Hieu Hoang <[email protected]> wrote:

was it due to the new version of Moses? It shouldn't be, if this is the cause 
please tell me urgently


On 30 May 2014 03:47, Vincent_hotmail <[email protected]> wrote:

Hi Gideon,
 I recently also came across the similar on training Chinese-other language 
pairs. I wonder if you use the latest version of Moses. I firstly use the 
Stanford, NLPIR or my own segmenter to tokenize the sentences, and then use 
escape-special-chars.perl < input.seg.zh > out.zh to process some special chars 
in them. Finally, the problem seems to be solved. But I never come across the 
same problem when using old version of Moses. If you have not solve it, pls try 
this one.
 Best,VincentMay 30, 2014 
----------------Longyue WANG, VincentResearch Assistant @ NLP2CT
Postgraduate @ University of MacauTel: (+853) 8397-8051
Homepage: http://nlp2ct.cis.umac.mo/~vincent/ 
 
_______________________________________________

Moses-support mailing list

[email protected]

http://mailman.mit.edu/mailman/listinfo/moses-support




-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Problems with segmentation mismatch and many unknown words for Chinese translation

Reply via email to