I've found the Stanford Chinese Segmenter ( http://nlp.stanford.edu/software/segmenter.shtml) to work well.
See the following paper for information on this segmenter and some perspective on the problem: Pi-Chuan Chang, Michel Galley and Chris Manning. "Optimizing Chinese Word Segmentation for Machine Translation Performance." in ACL Third Workshop on Statistical Machine Translation, 2008. http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf Kevin On Wed, Aug 17, 2011 at 10:25 PM, Tom Hoar < [email protected]> wrote: > I'm familiar with two methods to segment Chinese. One method simply inserts > a space between each character. The results are predictable, but > translations are generally not as high quality as possible. > > The second method uses a program that identifies words as sequences of > multiple characters (typically 1, 2 or 3) and inserts a space between them. > I haven't worked with Chinese for a while, so I'm not sure of the latest > advancements in Chinese word segmentation. LDC publishes a perl script, > http://projects.ldc.upenn.edu/Chinese/, > http://www.ldc.upenn.edu/Projects/Chinese/ldc-cn-seg.1.2.tgz. I remember > seeing a C++ version, but can't find it now. There's also this one on Google > code: http://code.google.com/p/zhseg/ > > Maybe someone on moses-support knows of other Chinese tools. > > Regards, > Tom > > > > On Thu, 18 Aug 2011 09:16:48 +0800, 蒋乾 <[email protected]> wrote: > > Hi, > > Thank your for your suggestions. > > I have done some test. It showed both English to Chinese and Chinese to > English training > would failed if I did not do any measures. > > Suzy and Tom gave me a useful advice that do something like segment. The > further question > is, how to do segment? > > Could anybody who has the experience of training corpus either from English > to Chinese or > from Chinese to English give me some idea? > > Thank you very much. > > Regards, > James > > 2011/8/17 Tom Hoar <[email protected]> > >> I agree with Suzy. Also, if your translation requests are not >> segmented, it's possible that the training corpus was also not >> segmented. Verify that your training corpus, develop and test sets were >> all segmented when you trained/tuned your translation model. If not, >> you'll need to start from the beginning. >> >> Tom >> >> >> On Wed, 17 Aug 2011 19:28:17 +1000, Suzy Howlett <[email protected]> >> wrote: >> > Hi James, >> > >> > It looks like the text has not been segmented into words, so it >> > thinks >> > every sentence is a single word. Unless the sentences you are trying >> > to >> > translate are identical to some sentences in the training corpus, it >> > will think every test sentence is an unknown word it's never seen >> > before. You'll need to use some kind of word segmentation. >> > Unfortunately >> > I don't know anything about that area, so I have no useful >> > suggestions. >> > >> > Best, >> > Suzy >> > >> > On 17/08/11 7:13 PM, 蒋乾 wrote: >> >> *Hi all, >> >> * >> >> *When I used MT to do translation from Chines to English, I meet an >> >> unexpected problem.Could you please tell * >> >> *me the reason if you have any idea about it?* >> >> ** >> >> *I trained a big amount of paralleled corpus about 2,600,000 lines >> >> on a >> >> computer with 5GB RAM.* >> >> *After that, I tried translating a small Chinese file about 80 lines >> >> into English.Unexpectedly, it didn't work.* >> >> *It did not do any translation work at all. The target file I got >> >> was as >> >> same as the source file.* >> >> ** >> >> *One sample line of the information shown on the screen during MT's >> >> traslation is as follows,* >> >> >> >> " >> >> Translating: 使用文本索引查询视图 >> >> Collecting options took 0.000 seconds >> >> Search took 0.000 seconds >> >> BEST TRANSLATION: 使用文本索引查询视图|UNK|UNK|UNK [1] >> >> [total=-99.978] <<0.000, -1.000, -100.000, 0.000, 0.000, 0.000, >> >> 0.000, 0.000, 0.000, -7.346, 0.000, 0.000, 0.000, 0.000, 0.000>> >> >> Translation took 0.000 seconds >> >> Finished translating >> >> Translating: 使用文本索引查询视图关于 >> >> Collecting options took 0.000 seconds >> >> Search took 0.000 seconds >> >> BEST TRANSLATION: 使用文本索引查询视图关于|UNK|UNK|UNK [1] >> >> [total=-99.978] <<0.000, -1.000, -100.000, 0.000, 0.000, 0.000, >> >> 0.000, 0.000, 0.000, -7.346, 0.000, 0.000, 0.000, 0.000, 0.000>> >> >> Translation took 0.000 seconds >> >> Finished translating >> >> " >> >> >> >> *It is very appreciated if you could tell me the reason why it >> >> happens >> >> and the way how to solve it.* >> >> ** >> >> *Thank you very much.* >> >> ** >> >> *Regards,* >> >> *James* >> >> >> >> >> >> _______________________________________________ >> >> Moses-support mailing list >> >> [email protected] >> >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
