You will need a Chinese word segmenter to prepare the data for training/decoding. There are several available (list in no particular order): http://code.google.com/p/zhseg/ http://nlp.stanford.edu/software/segmenter.shtml http://projects.ldc.upenn.edu/Chinese/LDC_ch.htm#cseg I haven't tried any of them and I believe most of them are for the Simplified Chinese script.
On Fri, Feb 12, 2010 at 11:10 PM, nati g <[email protected]> wrote: > Hello, > > Did any tried setting up moses for translating english --> chinese?. please > share any information ,scripts that can be used other than provided in step > by step guide. > > Thanks in Advance. > > On Fri, Feb 12, 2010 at 7:15 PM, Christine de Bond <[email protected]> wrote: > >> You might ask the moses-list people if anyone has done english-chinese >> translation / alignment and got any reasonable output. They might give you >> some more hints! >> >> by the way, how big is you parallel corpus? >> Another idea might be to check if factored translation models are of any >> help to you (I'm thinking of alignment and reordering factors here - but I'm >> not sure, if this is appropriate for Chinese...) >> >> nati g schrieb: >> >>> Hi Christine, >>> thank you very much for the information. >>> I had aleady tried skipping these steps, but the translation quality is >>> too bad. >>> unlike to europen languages,double byte languages like >>> chinese,koren,japanies have a different language syntax.for example >>> tanslation of an english string with few words may be in a single >>> character.i guess because of these types of synatic dissimilarites we are >>> not getting good translation model after training. >>> Thank you very much. >>> >>> On Thu, Feb 11, 2010 at 7:46 PM, Christine de Bond <[email protected]<mailto: >>> [email protected]>> wrote: >>> >>> Hi >>> I don't know much about Chinese, but there is no lowercase in >>> Chinese, right? >>> You can skip the lowercasing part, if there are no >>> capital/lowercase letters in Chinese. >>> >>> As for tokenizing - best is to have a look at the perl-script so >>> see what it's doing. You should make sure, that no punctuation (if >>> there is any in Chinese) is not concatenated with words ( word. -> >>> word . ) I think, the moses-tokenizer-script should work well for >>> your corpus - as long as there is no special issue in chinese >>> punctuation. >>> (I've so far used it with latin and persian character sets.) >>> >>> Best is to try out the tokenizer.perl script with some test >>> sentences to see what the script is doing to your input. >>> >>> Christine >>> >>> nati g schrieb: >>> >>> Hi, >>> Thank you very much reply. >>> i am having concerns about the tokenizer, lowercasing,sort >>> scripts while training the translation model from corpus. >>> will thsese no thave any effect on language going to use? >>> On Thu, Feb 11, 2010 at 2:43 PM, Christine de Bond >>> <[email protected] <mailto:[email protected]> <mailto:[email protected] >>> >>> <mailto:[email protected]>>> wrote: >>> >>> Hi, >>> moses is language-independent. There is no need for adaptation. >>> Best is to follow the "Step-by-Step Guide" on the moses >>> website to >>> get started. >>> >>> Regards, >>> Christine >>> >>> nati g schrieb: >>> >>> Hello, >>> Do we need any special scripts to build moses for >>> translating >>> english to chinese. >>> thanks in advance. >>> >>> ------------------------------------------------------------------------ >>> >>> >>> >>> _______________________________________________ >>> Moses-support mailing list >>> [email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>> >>> >>> >>> http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >>> > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
