The default tokenizer script only knows specific rules for a few languages. The 
fallback (English) rules may suffice for your purposes, they do the obvious 
thing with spaces and English punctuation, and also handle some special cases 
for abbreviations like "Mr." and "Mrs.". 

I'd suggest you eyeball the output and see if the result is OK for you.  If 
not, you could try editing the tokenizer and adding any abbreviations you would 
like to tokenize differently, or finding and using an Urdu-specific tokenizer.

As an aside for those who can edit the web site, this seems like a good 
candidate for the FAQ.

- John Burger
  MITRE

On Dec 26, 2013, at 11:35 , Asad A.Malik wrote:

> Hi All,
> 
> I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus and 
> the 1st step in manual is to tokenize the corpus, but when I enter following 
> command:
> 
> ~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur < 
> ~/SMT/corpus/training/mycorpus.ur-en.ur > ~/SMT/corpus/mycorpus.ur-en.tok.ur  
> 
> it gives me warning:
> 
> WARNING: No known abbreviations for language 'ur', attempting fall-back to 
> English version...
> 
> It also generates the output file but I don't know that this output is 
> tokenized or not
> 
> 
> Regards
> 
> Asad A.Malik
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to