When you compiled moses, it created a scripts folder. In there,
you'll find the subfolders "scripts/tokenizer/nonbreaking_prefixes". The
files in this folder all have the same name with a 2-letter language
code extension. These file have language-specific rules for how the
tokenizer & detokenizer work. 

Anyone, is there a better resource than
reading the existing files to learn how the files work? 

Tom 

On Wed,
30 May 2012 18:22:52 +0530, tharaka weheragoda  wrote:  

Thank you very
much for your answer.But i'm new to this field and i'm not aware about
how to create nonbreaking_prefixfiles.Is there any perticular way of
doing this.Can you explain me something more.

On Wed, May 30, 2012 at
6:13 PM, Tom Hoar  wrote:

Build your own nonbreaking_prefixes file.
Name it with the extension you want to use and save it in the
nonbreaking_prefixes subfolder under the moses scripts/tokenizer folder.
The existing files are commented with instructions to help you.  

Tom 


On Wed, 30 May 2012 17:37:19 +0530, tharaka weheragoda  wrote:  

Hi
everybody,

 When i'm trying to tokenize my sinhala dataset it gives me
a warning message like this 
 "WARNING: No known abbreviations for
language 'si', attempting fall-back to English version..."

And my
letters have changed a bit. Is their anyway to tokenize sinhala data
with this tokenizer.perl ?

I'm looking forward for your help.

Thanks
in advance!
Tharaka      


Links:
------
[1]
mailto:[email protected]
[2]
mailto:[email protected]
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to