I think you'll find what you're looking for in the manual.pdf. Search for "nonbreaking_prefixes" files. You'll also find more information in the existing scripts/share/nonbreaking_prefixes language files.

For example, the files' header says:
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. #Special cases are included for prefixes that ONLY appear before 0-9 numbers.




On 02/17/2015 08:24 AM, doc wrote:
Hello,
I am using the tokenizer.perl script which I found on
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perlI have tried to make it work for Indic languages which use the same punctuation markers with the exception of the full-stop which is a
ред U+0964 DEVANAGARI DANDA
My main issue is that Hindi and other languages using the character also use the full-stop as an abbreviation marker. How do I manage to keep both characters as tokenising elements? I would really appreciate if someone could take some time off and propose modifications to the perl script to accommodate also the Devanagari danda as well as the full-stop. I work in C and hence the issue.
I am appending the a small sample of Hindi <raw.txt> for testing
Many thanks for your help
Best regards,

Raymond




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to