Re: [Moses-support] tokenizer.perl weirdness with some patterns

Pidong Wang Sun, 05 Jul 2015 01:07:17 -0700

Yes, this is expected. I do not know the exact reason, but I guess we
assume well-written input which has proper casing (e.g., "I don't
understand your reactions. *S*orry.").


Best wishes!
Pidong

On 2 July 2015 at 12:48, Ozan Çağlayan <[email protected]> wrote:

> Hello,
>
> $ echo "tu ne peux pas me voir.  blabla" | tokenizer.perl -l fr
> tu ne peux pas me voir. blabla
>
> $ echo -n "I don't understand your reactions. sorry." | tokenizer.perl -l
> en
> I don &apos;t understand your reactions. sorry .
>
> So the problem is that if a dot is followed by a space and then a
> lowercase letter, it is not tokenized. This is happening in at least
> the french tasks of IWSLT. Is this expected? The responsible line for
> this problem is tokenizer.perl:330. What should I lose if I comment
> out the responsible part for this in large scale processing?
>
> Thanks.
>
> PS: I also filed an issue for this:
> https://github.com/moses-smt/mosesdecoder/issues/118
>
>
>
> --
> Ozan Çağlayan
> Research Assistant
> Galatasaray University - Computer Engineering Dept.
> http://www.ozancaglayan.com
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer.perl weirdness with some patterns

Reply via email to