Yes, this is expected. I do not know the exact reason, but I guess we assume well-written input which has proper casing (e.g., "I don't understand your reactions. *S*orry.").
Best wishes! Pidong On 2 July 2015 at 12:48, Ozan Çağlayan <[email protected]> wrote: > Hello, > > $ echo "tu ne peux pas me voir. blabla" | tokenizer.perl -l fr > tu ne peux pas me voir. blabla > > $ echo -n "I don't understand your reactions. sorry." | tokenizer.perl -l > en > I don 't understand your reactions. sorry . > > So the problem is that if a dot is followed by a space and then a > lowercase letter, it is not tokenized. This is happening in at least > the french tasks of IWSLT. Is this expected? The responsible line for > this problem is tokenizer.perl:330. What should I lose if I comment > out the responsible part for this in large scale processing? > > Thanks. > > PS: I also filed an issue for this: > https://github.com/moses-smt/mosesdecoder/issues/118 > > > > -- > Ozan Çağlayan > Research Assistant > Galatasaray University - Computer Engineering Dept. > http://www.ozancaglayan.com > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
