When you compiled moses, it created a scripts folder. In there, you'll find the subfolders "scripts/tokenizer/nonbreaking_prefixes". The files in this folder all have the same name with a 2-letter language code extension. These file have language-specific rules for how the tokenizer & detokenizer work.
Anyone, is there a better resource than reading the existing files to learn how the files work? Tom On Wed, 30 May 2012 18:22:52 +0530, tharaka weheragoda wrote: Thank you very much for your answer.But i'm new to this field and i'm not aware about how to create nonbreaking_prefixfiles.Is there any perticular way of doing this.Can you explain me something more. On Wed, May 30, 2012 at 6:13 PM, Tom Hoar wrote: Build your own nonbreaking_prefixes file. Name it with the extension you want to use and save it in the nonbreaking_prefixes subfolder under the moses scripts/tokenizer folder. The existing files are commented with instructions to help you. Tom On Wed, 30 May 2012 17:37:19 +0530, tharaka weheragoda wrote: Hi everybody, When i'm trying to tokenize my sinhala dataset it gives me a warning message like this "WARNING: No known abbreviations for language 'si', attempting fall-back to English version..." And my letters have changed a bit. Is their anyway to tokenize sinhala data with this tokenizer.perl ? I'm looking forward for your help. Thanks in advance! Tharaka Links: ------ [1] mailto:[email protected] [2] mailto:[email protected]
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
