Re: [Moses-support] tokenizer.perl weirdness with some patterns

Philipp Koehn Fri, 03 Jul 2015 10:09:11 -0700

Hi,

the proper handling of "." in tokenization is a hard problem.


The heuristics that the script uses to determine if a period
is a end-of-sentence period and hence should be separated
(and not an abbrev. period that should stay attached) include
checking if the next word is uppercase. In your example,
the next word is lowercase, so the script concludes that
it is an abbreviation period and hence does not split it off.

You may change the script in any way you want for your
own purposes. It is hard to predict what the effect of that
will be for machine translation quality in your case.

-phi


On Thu, Jul 2, 2015 at 3:48 PM, Ozan Çağlayan <[email protected]> wrote:
> Hello,
>
> $ echo "tu ne peux pas me voir.  blabla" | tokenizer.perl -l fr
> tu ne peux pas me voir. blabla
>
> $ echo -n "I don't understand your reactions. sorry." | tokenizer.perl -l en
> I don &apos;t understand your reactions. sorry .
>
> So the problem is that if a dot is followed by a space and then a
> lowercase letter, it is not tokenized. This is happening in at least
> the french tasks of IWSLT. Is this expected? The responsible line for
> this problem is tokenizer.perl:330. What should I lose if I comment
> out the responsible part for this in large scale processing?
>
> Thanks.
>
> PS: I also filed an issue for this:
> https://github.com/moses-smt/mosesdecoder/issues/118
>
>
>
> --
> Ozan Çağlayan
> Research Assistant
> Galatasaray University - Computer Engineering Dept.
> http://www.ozancaglayan.com
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer.perl weirdness with some patterns

Reply via email to