Hello there,
If it is Indian Languages then this is what you need:
http://www.cfilt.iitb.ac.in/static/download.html
Regards.

On Tue, Feb 17, 2015 at 10:24 AM, doc <[email protected]> wrote:

> Hello,
> I am using the tokenizer.perl script which I found on
>
> https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perlI
> have tried to make it work for Indic languages which use the same punctuation
> markers with the exception of the full-stop which is a
> ред U+0964 DEVANAGARI DANDA
> My main issue is that Hindi and other languages using the character also
> use the full-stop as an abbreviation marker. How do I manage to
> keep both characters as tokenising elements? I would really appreciate if
> someone could take some time off and propose modifications to the perl
> script to accommodate also the Devanagari danda as well as the full-stop. I
> work in C and hence the issue.
> I am appending the a small sample of Hindi <raw.txt> for testing
> Many thanks for your help
> Best regards,
>
> Raymond
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to