Re: [Moses-support] TOKENIZER.PERL

Tom Hoar Mon, 16 Feb 2015 18:10:56 -0800

I think you'll find what you're looking for in the manual.pdf. Searchfor "nonbreaking_prefixes" files. You'll also find more information inthe existing scripts/share/nonbreaking_prefixes language files.


For example, the files' header says:

#Anything in this file, followed by a period (and an upper-case word),does NOT indicate an end-of-sentence marker.#Special cases are included for prefixes that ONLY appear before 0-9numbers.





On 02/17/2015 08:24 AM, doc wrote:

Hello,
I am using the tokenizer.perl script which I found on
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perlIhave tried to make it work for Indic languages which use the samepunctuation markers with the exception of the full-stop which is a
। U+0964 DEVANAGARI DANDA
My main issue is that Hindi and other languages using the characteralso use the full-stop as an abbreviation marker. How do I manage tokeep both characters as tokenising elements? I would really appreciateif someone could take some time off and propose modifications to theperl script to accommodate also the Devanagari danda as well as thefull-stop. I work in C and hence the issue.
I am appending the a small sample of Hindi <raw.txt> for testing
Many thanks for your help
Best regards,

Raymond




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] TOKENIZER.PERL

Reply via email to