Hello there, If it is Indian Languages then this is what you need: http://www.cfilt.iitb.ac.in/static/download.html Regards.
On Tue, Feb 17, 2015 at 10:24 AM, doc <[email protected]> wrote: > Hello, > I am using the tokenizer.perl script which I found on > > https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perlI > have tried to make it work for Indic languages which use the same punctuation > markers with the exception of the full-stop which is a > ред U+0964 DEVANAGARI DANDA > My main issue is that Hindi and other languages using the character also > use the full-stop as an abbreviation marker. How do I manage to > keep both characters as tokenising elements? I would really appreciate if > someone could take some time off and propose modifications to the perl > script to accommodate also the Devanagari danda as well as the full-stop. I > work in C and hence the issue. > I am appending the a small sample of Hindi <raw.txt> for testing > Many thanks for your help > Best regards, > > Raymond > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- Raj Dabre. Research Student, Graduate School of Informatics, Kyoto University. CSE MTech, IITB., 2011-2014
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
