Hi, the proper handling of "." in tokenization is a hard problem.
The heuristics that the script uses to determine if a period is a end-of-sentence period and hence should be separated (and not an abbrev. period that should stay attached) include checking if the next word is uppercase. In your example, the next word is lowercase, so the script concludes that it is an abbreviation period and hence does not split it off. You may change the script in any way you want for your own purposes. It is hard to predict what the effect of that will be for machine translation quality in your case. -phi On Thu, Jul 2, 2015 at 3:48 PM, Ozan Çağlayan <[email protected]> wrote: > Hello, > > $ echo "tu ne peux pas me voir. blabla" | tokenizer.perl -l fr > tu ne peux pas me voir. blabla > > $ echo -n "I don't understand your reactions. sorry." | tokenizer.perl -l en > I don 't understand your reactions. sorry . > > So the problem is that if a dot is followed by a space and then a > lowercase letter, it is not tokenized. This is happening in at least > the french tasks of IWSLT. Is this expected? The responsible line for > this problem is tokenizer.perl:330. What should I lose if I comment > out the responsible part for this in large scale processing? > > Thanks. > > PS: I also filed an issue for this: > https://github.com/moses-smt/mosesdecoder/issues/118 > > > > -- > Ozan Çağlayan > Research Assistant > Galatasaray University - Computer Engineering Dept. > http://www.ozancaglayan.com > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
