[Moses-support] tokenizer.perl weirdness with some patterns

Ozan Çağlayan Thu, 02 Jul 2015 12:51:13 -0700

Hello,

$ echo "tu ne peux pas me voir.  blabla" | tokenizer.perl -l fr
tu ne peux pas me voir. blabla


$ echo -n "I don't understand your reactions. sorry." | tokenizer.perl -l en
I don &apos;t understand your reactions. sorry .

So the problem is that if a dot is followed by a space and then a
lowercase letter, it is not tokenized. This is happening in at least
the french tasks of IWSLT. Is this expected? The responsible line for
this problem is tokenizer.perl:330. What should I lose if I comment
out the responsible part for this in large scale processing?

Thanks.

PS: I also filed an issue for this:
https://github.com/moses-smt/mosesdecoder/issues/118



-- 
Ozan Çağlayan
Research Assistant
Galatasaray University - Computer Engineering Dept.
http://www.ozancaglayan.com

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] tokenizer.perl weirdness with some patterns

Reply via email to