I've found the Stanford Chinese Segmenter (
http://nlp.stanford.edu/software/segmenter.shtml) to work well.

See the following paper for information on this segmenter and some
perspective on the problem:
Pi-Chuan Chang, Michel Galley and Chris Manning. "Optimizing Chinese Word
Segmentation for Machine Translation Performance." in ACL Third Workshop on
Statistical Machine Translation, 2008.
http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf

Kevin

On Wed, Aug 17, 2011 at 10:25 PM, Tom Hoar <
[email protected]> wrote:

> I'm familiar with two methods to segment Chinese. One method simply inserts
> a space between each character. The results are predictable, but
> translations are generally not as high quality as possible.
>
> The second method uses a program that identifies words as sequences of
> multiple characters (typically 1, 2 or 3) and inserts a space between them.
> I haven't worked with Chinese for a while, so I'm not sure of the latest
> advancements in Chinese word segmentation. LDC publishes a perl script,
> http://projects.ldc.upenn.edu/Chinese/,
> http://www.ldc.upenn.edu/Projects/Chinese/ldc-cn-seg.1.2.tgz. I remember
> seeing a C++ version, but can't find it now. There's also this one on Google
> code: http://code.google.com/p/zhseg/
>
> Maybe someone on moses-support knows of other Chinese tools.
>
> Regards,
> Tom
>
>
>
> On Thu, 18 Aug 2011 09:16:48 +0800, 蒋乾 <[email protected]> wrote:
>
> Hi,
>
> Thank your for your suggestions.
>
> I have done some test. It showed both English to Chinese and Chinese to
> English training
> would failed if I did not do any measures.
>
> Suzy and Tom gave me a useful advice that do something like segment. The
> further question
> is,  how to do segment?
>
> Could anybody who has the experience of training corpus either from English
> to Chinese or
> from Chinese to English give me some idea?
>
> Thank you very much.
>
> Regards,
> James
>
> 2011/8/17 Tom Hoar <[email protected]>
>
>>  I agree with Suzy. Also, if your translation requests are not
>>  segmented, it's possible that the training corpus was also not
>>  segmented. Verify that your training corpus, develop and test sets were
>>  all segmented when you trained/tuned your translation model. If not,
>>  you'll need to start from the beginning.
>>
>>  Tom
>>
>>
>>  On Wed, 17 Aug 2011 19:28:17 +1000, Suzy Howlett <[email protected]>
>>  wrote:
>> > Hi James,
>> >
>> > It looks like the text has not been segmented into words, so it
>> > thinks
>> > every sentence is a single word. Unless the sentences you are trying
>> > to
>> > translate are identical to some sentences in the training corpus, it
>> > will think every test sentence is an unknown word it's never seen
>> > before. You'll need to use some kind of word segmentation.
>> > Unfortunately
>> > I don't know anything about that area, so I have no useful
>> > suggestions.
>> >
>> > Best,
>> > Suzy
>> >
>> > On 17/08/11 7:13 PM, 蒋乾 wrote:
>> >> *Hi all,
>> >> *
>> >> *When I used MT to do translation from Chines to English, I meet an
>> >> unexpected problem.Could you please tell *
>> >> *me the reason if you have any idea about it?*
>> >> **
>> >> *I trained a big amount of paralleled corpus about 2,600,000 lines
>> >> on a
>> >> computer with 5GB RAM.*
>> >> *After that, I tried translating a small Chinese file about 80 lines
>> >> into English.Unexpectedly, it didn't work.*
>> >> *It did not do any translation work at all. The target file I got
>> >> was as
>> >> same as the source file.*
>> >> **
>> >> *One sample line of the information shown on the screen during MT's
>> >> traslation is as follows,*
>> >>
>> >>     "
>> >>     Translating: 使用文本索引查询视图
>> >>     Collecting options took 0.000 seconds
>> >>     Search took 0.000 seconds
>> >>     BEST TRANSLATION: 使用文本索引查询视图|UNK|UNK|UNK [1]
>> >>     [total=-99.978] <<0.000, -1.000, -100.000, 0.000, 0.000, 0.000,
>> >>     0.000, 0.000, 0.000, -7.346, 0.000, 0.000, 0.000, 0.000, 0.000>>
>> >>     Translation took 0.000 seconds
>> >>     Finished translating
>> >>     Translating: 使用文本索引查询视图关于
>> >>     Collecting options took 0.000 seconds
>> >>     Search took 0.000 seconds
>> >>     BEST TRANSLATION: 使用文本索引查询视图关于|UNK|UNK|UNK [1]
>> >>     [total=-99.978] <<0.000, -1.000, -100.000, 0.000, 0.000, 0.000,
>> >>     0.000, 0.000, 0.000, -7.346, 0.000, 0.000, 0.000, 0.000, 0.000>>
>> >>     Translation took 0.000 seconds
>> >>     Finished translating
>> >>     "
>> >>
>> >> *It is very appreciated if you could tell me the reason why it
>> >> happens
>> >> and the way how to solve it.*
>> >> **
>> >> *Thank you very much.*
>> >> **
>> >> *Regards,*
>> >> *James*
>> >>
>> >>
>> >> _______________________________________________
>> >> Moses-support mailing list
>> >> [email protected]
>> >> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to