Hi Wenlong Written Chinese does not normally have spaces between the words, so you need to split it up into words somehow. This is a hard problem, but there has been a lot of research on it and there are tools available to do it for you. I pointed you at a couple, but I'm sure google could find you more.
And yes, word segmentation would replace the tokenisation step, if Chinese is your source language. regards Barry On Friday 03 September 2010 07:56, Wenlong Yang wrote: > Hi Barry, > > Thanks for your information. > I am still not sure about what the 'tokenizing' and the 'detokenizing' is, > I mean, what they did and why those handlings are needed. > Is the 'tokenizing' something the same with the segmenting? > > BTW, I am not familar with Java. > Is there any such script wrote by perl/python/C#/C++? > And this script just can simply replace the default 'tokenizing' script in > the Moses training step, right? > > > Thanks, > Wenlong -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
