I think you'll find what you're looking for in the manual.pdf. Search
for "nonbreaking_prefixes" files. You'll also find more information in
the existing scripts/share/nonbreaking_prefixes language files.
For example, the files' header says:
#Anything in this file, followed by a period (and an upper-case word),
does NOT indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9
numbers.
On 02/17/2015 08:24 AM, doc wrote:
Hello,
I am using the tokenizer.perl script which I found on
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perlI
have tried to make it work for Indic languages which use the same
punctuation markers with the exception of the full-stop which is a
ред U+0964 DEVANAGARI DANDA
My main issue is that Hindi and other languages using the character
also use the full-stop as an abbreviation marker. How do I manage to
keep both characters as tokenising elements? I would really appreciate
if someone could take some time off and propose modifications to the
perl script to accommodate also the Devanagari danda as well as the
full-stop. I work in C and hence the issue.
I am appending the a small sample of Hindi <raw.txt> for testing
Many thanks for your help
Best regards,
Raymond
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support