Re: [Moses-support] Train Moses Engine for EN to ZH_CN

Barry Haddow Fri, 03 Sep 2010 03:40:12 -0700

Hi Wenlong

Written Chinese does not normally have spaces between the words, so you need 
to split it up into words somehow. This is a hard problem, but there has been 
a lot of research on it and there are tools available to do it for you. I 
pointed you at a couple, but I'm sure google could find you more.


And yes, word segmentation would replace the tokenisation step, if Chinese is 
your source language.

regards
Barry

On Friday 03 September 2010 07:56, Wenlong Yang wrote:
> Hi Barry,
>
> Thanks for your information.
> I am still not sure about what the 'tokenizing' and the 'detokenizing' is,
> I mean, what they did and why those handlings are needed.
> Is the 'tokenizing'  something the same with the segmenting?
>
> BTW, I am not familar with Java.
> Is there any such script wrote by perl/python/C#/C++?
> And this script just can simply replace the default 'tokenizing'  script in
> the Moses training step, right?
>
>
> Thanks,
> Wenlong

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Train Moses Engine for EN to ZH_CN

Reply via email to