The default tokenizer script only knows specific rules for a few languages. The fallback (English) rules may suffice for your purposes, they do the obvious thing with spaces and English punctuation, and also handle some special cases for abbreviations like "Mr." and "Mrs.".
I'd suggest you eyeball the output and see if the result is OK for you. If not, you could try editing the tokenizer and adding any abbreviations you would like to tokenize differently, or finding and using an Urdu-specific tokenizer. As an aside for those who can edit the web site, this seems like a good candidate for the FAQ. - John Burger MITRE On Dec 26, 2013, at 11:35 , Asad A.Malik wrote: > Hi All, > > I am trying to develop Urdu SMT using MOSES. I have Urdu parallel corpus and > the 1st step in manual is to tokenize the corpus, but when I enter following > command: > > ~/SMT/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ur < > ~/SMT/corpus/training/mycorpus.ur-en.ur > ~/SMT/corpus/mycorpus.ur-en.tok.ur > > it gives me warning: > > WARNING: No known abbreviations for language 'ur', attempting fall-back to > English version... > > It also generates the output file but I don't know that this output is > tokenized or not > > > Regards > > Asad A.Malik > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
