Hello, $ echo "tu ne peux pas me voir. blabla" | tokenizer.perl -l fr tu ne peux pas me voir. blabla
$ echo -n "I don't understand your reactions. sorry." | tokenizer.perl -l en I don 't understand your reactions. sorry . So the problem is that if a dot is followed by a space and then a lowercase letter, it is not tokenized. This is happening in at least the french tasks of IWSLT. Is this expected? The responsible line for this problem is tokenizer.perl:330. What should I lose if I comment out the responsible part for this in large scale processing? Thanks. PS: I also filed an issue for this: https://github.com/moses-smt/mosesdecoder/issues/118 -- Ozan Çağlayan Research Assistant Galatasaray University - Computer Engineering Dept. http://www.ozancaglayan.com _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
