Hey Vincent and Hieu, Thanks a lot for your replies. This falls in line with my own expectation that something in the preprocessing of the data could be not going right. But in my case I sort of wrote my own pipeline for the preprocessing of the LDC Honkong Hansards,Laws and News data, so this is not a problem of Moses itself. I am actually still working to run some new experiments using the MultiUN data from http://opus.lingfil.uu.se/. Looking up some of the unknown words from my old test output, it turns out that many of them can be found in the new segmented MultiUN data. So while I have no results yet, I am hopeful that the segmentation mismatch problem will be resoleved using the MultiUN data (which is already in simplified Chinese and further preprocessed than the LDC Honkong data, so less chance of errors in the segmentation/preprocessing to slip in). If experiments with this data indeed will give "normal" results, then I'll be sure that it is the preparation of the Honkong Hansards data which goes wrong.
This still does not completely answer the question what step(s) in my (pre)processing of the Chinese data are insufficient, but I thinks that Vincent's suggestion that it could be these special characters and using the script might resolve it is a very promising one. I will try this out as soon as I finished my pending experiments with the MultiUn data. I will update you as soon as I know more. Cheers. Gideon CC: [email protected]; [email protected] From: [email protected] Subject: Re: [Moses-support] Problems with segmentation mismatch and many unknown words for Chinese translation Date: Tue, 3 Jun 2014 01:35:44 +0100 To: [email protected] Hey Vincent and Gideon Did you have any details of how it fails on new Moses but runs on on the old Moses? Or is it speculation? It's really important that I know so I can try and fix it Sent from my flying horse On 30 May 2014, at 05:11, Hieu Hoang <[email protected]> wrote: was it due to the new version of Moses? It shouldn't be, if this is the cause please tell me urgently On 30 May 2014 03:47, Vincent_hotmail <[email protected]> wrote: Hi Gideon, I recently also came across the similar on training Chinese-other language pairs. I wonder if you use the latest version of Moses. I firstly use the Stanford, NLPIR or my own segmenter to tokenize the sentences, and then use escape-special-chars.perl < input.seg.zh > out.zh to process some special chars in them. Finally, the problem seems to be solved. But I never come across the same problem when using old version of Moses. If you have not solve it, pls try this one. Best,VincentMay 30, 2014 ----------------Longyue WANG, VincentResearch Assistant @ NLP2CT Postgraduate @ University of MacauTel: (+853) 8397-8051 Homepage: http://nlp2ct.cis.umac.mo/~vincent/ _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
